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ABSTRACT 

Matching dependencies (MDs) have been recently introduced 
as declarative rules for entity resolution (ER), i.e. for iden- 
tifying and resolving duplicates in relational instance D. A 
set of MDs can be used as the basis for a possibly non- 
deterministic mechanism that computes a duplicate-free in- 
stance from D. The possible results of this process are the 
clean, minimally resolved instances (MRIs). There might be 
several MRIs for D, and the resolved answers to a query are 
those that are shared by all the MRIs. We investigate the 
problem of computing resolved answers. We look at various 
sets of MDs, developing syntactic criteria for determining 
(in)tractability of the resolved answer problem, including a 
dichotomy result. For some tractable classes of MDs and 
conjunctive queries, we present a query rewriting methodol- 
ogy that can be used to retrieve the resolved answers. We 
also investigate connections with consistent query answer- 
ing, deriving further tractability results for MD-based ER. 

1. INTRODUCTION 

For different reasons, databases may contain different co- 
existing representations of the same external, real world en- 
tity. Those duplicates can be entire tuples or values within 
them. Ideally, those tuples or values should be merged into 
a single representation. Identifying and merging duplicates 
is a process called entity resolution (ER) [11, 14]. Matching 
dependencies (MDs) are a recent proposal for declarative 
duplicate resolution [15, 16]. An MD expresses, in the form 
of a rule, that if the values of certain attributes in a pair 
of tuples are similar, then the values of other attributes in 
those tuples should be matched (or merged) into a common 
value. 

For example, the MD Ri[Xi] « R2[X2\ Ri\Yi] = 
R2 [^2] says that if an i?i-tuple and _R2-tuple have similar val- 
ues for attributes X\ , X2 , then their values for Yi , Y2 should 
be made equal. This is a dynamic dependency, in the sense 
that its satisfaction is checked against a pair of instances: 
the first where the antecedent holds and the second where 
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the identification of values takes place. This semantics of 
MDs was sketched in [16]. 

The original semantics was refined in [10], including the 
use of matching functions to do the matching of two at- 
tribute values. Furthermore, the minimality of changes (due 
to the matchings) is guaranteed by means of a chase-like pro- 
cedure that changes values only when strictly needed. 

An alternative refinement of the original semantics was 
proposed in [21], which is the basis for this work. In this 
case, arbitrary values can be used for the matching. The 
semantics is also based on a chase-like procedure. However, 
the minimality of the number of changes is explicitly im- 
posed. In more detail, in order to obtain a clean instance, 
an iterative procedure is applied, in which the MDs are ap- 
plied repeatedly. At each step, merging of duplicates can 
generate additional similarities between values, which forces 
the MDs to be applied again and again, until a clean instance 
is reached. Although MDs indicate values to be merged, the 
clean instance obtained by applying this iterative process to 
a dirty instance will in general depend on how the merging 
is done, and MDs do not specify this. As expected, MDs can 
be applied in different orders. As a consequence, alternative 
clean instances can be obtained. They are defined in [21] as 
the minimally resolved instances (MRIs). 

Since there might be large portions of data that are not 
affected by the occurrence of duplicates or by the entity 
resolution process, no matter how it is applied, it becomes 
relevant to characterize and obtain those pieces of data that 
are invariant under the cleaning process. They could be, 
in particular, answers to queries. The resolved answers [21] 
to a query posed to the original, dirty database are those 
answers to the query that are invariant under the entity res- 
olution process. In principle, the resolved answers could be 
obtained by computing all the MRIs, and posing the query 
to all of them, identifying later the shared answers. This 
may be too costly, and more efficient alternatives should be 
used whenever possible, e.g. a mechanism that uses only the 
original, dirty instance. 

In [21], the problem of computing resolved answers to a 
query was introduced, and some preliminary and isolated 
complexity results were given. In this work we largely extend 
those results on resolved query answering, providing new 
complexity results, in Sections 3 and 5. For tractable cases, 
and for the first time, a query rewriting methodology for 
efficiently retrieving the resolved answers is presented, in 
Section 4. 

Summarizing, in this paper, we undertake the first system- 
atic investigation of the complexity of the problems of com- 



puting and deciding resolved answers to conjunctive queries. 
More, precisely, the contributions of this paper are as fol- 
lows: 

1. Starting with the simplest cases of MDs and queries, 
we consider the complexity of computing the resolved 
answers. We provide syntactic characterizations of 
easy and hard cases of MDs. 

2. For certain sets of two MDs, we establish a dichotomy 
result, proving that deciding the resolved answers is in 
PTIME or A^P-hard in data. 

3. We then move on to larger sets of MDs, establishing, in 
particular, tractability for some interesting cyclic sets 
of MDs. 

4. We consider the problem of retrieving the resolved an- 
swers to a query by querying the original dirty database 
instance. For tractable classes of MDs, and a class of 
first-order conjunctive queries, we show that a query 
can be rewritten into a new query that, posed to the 
original dirty instance, returns the resolved answers to 
the original query. Although the rewritten query is 
not necessarily first-order, it can be expressed in pos- 
itive Datalog with recursion and counting, which can 
be evaluated in polynomial time. 

5. We establish a connection between MRIs and database 
repairs under key constraints as found in consistent 
query answering (CQA) [3, 7, 12]. In CQA, the repair 
semantics is usually based on deletion of whole tuples, 
and minimality on comparison under set inclusion. Re- 
ductions from/to CQA allow us to profit from results 
for CQA, obtaining additional (in)tractability results 
for resolved query computation under MDs. 

These intractability results are important in that they 
show that our query rewriting methodology in 4. does 
not apply to all conjunctive queries. On the other 
hand, the tractable cases identified via CQA differ 
from those in item 4.: The class of MDs is more re- 
strictive, but the class of conjunctive queries is larger. 

Our complexity analysis sheds some initial light on the in- 
trinsic computational limitations of retrieving the informa- 
tion from a dirty database that is invariant under entity 
resolution processes, as captured by MDs. 

The structure of the paper is as follows. Section 2 in- 
troduces notation used in the paper and reviews necessary 
material from previous publications. Section 3 investigates 
the complexity of the problem of computing resolved an- 
swers, identifying various tractable and intractable cases. In 
Section 4, an efhcient query rewriting methodology for ob- 
taining the resolved answers (in tractable cases) is described. 
Section 5 establishes the connection with CQA. In Section 
6 we draw some final conclusions. 

2. PRELIMINARIES 

We consider a relational schema <S that includes an enu- 
merable, possibly infinite domain U , and a set TZ of database 
predicates. S determines a first-order (FO) language L(S). 
An instance D of 5 is a finite set of ground atoms of the form 
R{t), with R gTZ, say of arity n, and i £ U"'. R{D) denotes 
the extension of R in D. The set of all attributes of R is 



denoted by attr{R). We sometimes refer to attribute A oi R 
by i?[j4]. We assume that all the attributes are different, and 
that we can identify attributes with positions in predicates, 
e.g. R[i], with 1 < i < n. If the ith attribute of predicate 
R is A, for a tuple f = (ci, . . . , c„) £ R{D), tjl[A] (usually, 
simply tii[A] or t[A] if the instance is understood) denotes 
the value Ci. The symbol t[A] denotes the tuple whose en- 
tries are the values of the attributes in A. Attributes have 
and may share subdomains that are contained in U. 

In order to compare instances, obtained from the same 
instance through changes of attribute values, we use tuple 
identifiers: Each database tuple R{ci, . . . ,c„) £ D has an 
identifier, say t, making the tuple implicitly become R{t,c\, 
...,Cn). The t value is taken by an additional attribute, 
say T, that acts as a key. Identifiers are not subject to 
updates, and are usually left implicit. Sometimes we do not 
distinguish between a tuple and its tuple identifier. That is, 
with now t a tuple identifier (value), t^ denotes the tuple 
i?(ci, . . . , c„) above; and f|J[y4i], the value for attribute Ai, 
i.e. Ci above. ^ Two instances over the same schema that 
share the same tuple identifiers are said to be correlated. 
In this case it is possible to unambiguously compare their 
tuples. 

A matching dependency (MD) [15], involving predicates 
R(A\, . . . , An), S{Bi, . . . , Bm), is a rule, m, of the form 

m: /\ R[A,] ^„ S[B,] /\ R[A,] = S[B,]. (1) 

The set of attributes on the left-hand-side (LHS) of m (wrt 
the arrow) is denoted with LHS{m). Similarly for the right- 
hand-side. The domain-dependent binary relations ~ij de- 
note similarity of attribute values from a shared domain. 
The symbol = means that the values of the pair of attributes 
in ti and t2 should be updated to the same value. In con- 
sequence, the intended semantics of the MD is that if any 
pair of tuples, t\ £ R{D) and t2 £ S{D), satisfy the simi- 
larity conditions on the LHS, then for the same tuples the 
attributes indicated on the RHS have to take the same val- 
ues [16].^ The similarity relations, generically denoted with 
«, are symmetric, and reflexive. We assume that all sets M 
of MDs are in standard form, i.e. for no two different MDs 
mi,m2 £ M, LHS{mi) = LHS{m2). All sets of MDs can 
be put in this form. 

For abbreviation, we will sometimes write MDs as 

R[A] ^ S[B] R[C] = S[E], (2) 

with A = (Ai,...,A-), B = (Bi,...,Bfe), C = {Ci,...,Ck'), 
and E = {Ei, ...,Ek>) lists of attributes. The pairs {Ai,Bi) 
and {Ci,Ei) are called corresponding pairs of attributes in 
(A,B) and {C,E), resp. For an instance D and a pair of 
tuples ti £ R{D) and t2 £ S{D), ti[A] ^ t2[B] indicates 
that the similarities of the values for all corresponding pairs 
of attributes of {A, B) hold. The notation ti[C] = t2[E] is 
used similarly. 

Definition 1. [21] For a set M of MDs, the MD-graph, 
MDG{M), is a directed graph with a vertex m for each 
m £ M, and with an edge from mi to m2 iff RHS {mi) n 
LHS{m2) / 0. □ 

^If there there is not danger of confusion, we sometimes omit 
D or R from tg, t'^[A]. 

^We assume that instances and MDs share the same schema. 



MD-graphs can have self-loops. If the MD-graph of a set of 
MDs contains edges it is called interacting. Otherwise, it is 
called non-interacting. 

Updates as prescribed by an MD are not arbitrary. The 
allowed updates are the matching of values when the precon- 
ditions are met, which is captured by the set of modifiable 
values. 

Definition 2. Let D be an instance, R £ TZ, tn £ R{D), 
C an attribute of R, and M a set of MDs. Value t|{[C] is 
modifiable if there exist S £ TZ, ts € S{D), an m G M of 
the form R[A] p^_S[B] R[C] = S[E], and a corresponding 
pair {C,E) of {C,E), such that one of the following holds: 

1. tR[A] ^ ts[B], but tR[C] ^ ts[E]. 

2. tR[A] « ts[B] and is[-E] is modifiable. Value tfl[C] is 
potentially modifiable if tii[A] « ts[B] holds. For a list of 
attributes C, tii[C] is (potentially) modifiable iff there is a 
C in C such that tii[C] is (potentially) modifiable. □ 

Definition 3. [21] Let D, D' be correlated instances, and 
M a set of MDs. [D, D') satisfies M, denoted (D, D') N Af, 
iff: 1. For any pair of tuples tR G R{D), ts G S{D), if 
there exists an m G M of the form R[A] ^ S[B] ^ R[C] = 
S[E] and tR[A] ~ ts[B], then for the corresponding tuples 

4 G R{D') and t's G S{D'), it holds ^p] = t's[E]. 

2. For any tuple tR G R{D) and any attribute G of R, if 
tR[G] is non-modifiable, then tij[G] = tR[G]. □ 

This definition of MD satisfaction departs from [16], which 
requires that updates preserve similarities. Similarity preser- 
vation may force undesirable changes [21]. The existence of 
the updated instance D' for D is guaranteed [21]. Further- 
more, wrt [16], our definition does not allow unnecessary 
changes from D to D' . Definitions 2 and 3 require that 
only values of attributes that appear on RHS of the arrow 
in some MD are subject to updates. This motivates the 
following definition. 

Definition 4- For a set M of MDs defined on schema S, 
the changeable attributes of S are those that appear to the 
right of the arrow in some m £ M. The other attributes of 

5 are called unchangeable. □ 

Definition 3 allows us to define a clean instance wrt AI as the 
result of a sequence of updates, each step being satisfaction 
preserving, leading to a stable instance [16]. 

Definition 5. [21] A resolved instance for D wrt M is 
an instance D' , such that there is sequence of instances 
Di,D2,...D„ with: (L>,73i)NM, (L>i , Da) N M,..., (D„-i, 
_D„) N M, {D„,D') N M, and {D',D') \= M. {D' is stable.) 

□ 



Example 1. Consider the MD R[A] ^ R[A] 
R[B] on predicate R, and an instance D: 
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It has several resolved in- 
stances, among them, four 
that minimize the number 
of changes. One of them is 
Di below. A resolved in- 



stance that is not minimal in this sense is D2. 
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As suggested by the previous example, we will require that 
the number of changes wrt instance D are minimized. 

Definition 6. For an instance D of schema S, 

(a) Td ~ {(t, A) \ t is the id of a tuple in D and A is an 
attribute of the tuple}. 

(b) f D : Td ^ U is given by: foit, A) := the value for A 
in the tuple in D with id t. 

(c) For an instance D' with the same tuple ids as D, 
Sd,d' ■■= {{t,A) £ Td I fait, A) ^ fD'{t,A)}. □ 

Definition 7. [21] A minimally resolved instance (MRI) 
of D wrt M is a resolved instance D' such that \Sd,d'\ 
is minimum, i.e. there is no resolved instance D" with 

|»S'_D__D"| < \Sd,D'\- ^ 

Example 2. (Example 1 continued) It holds Sd.Di = { 
(t2 , B) , (t4 , B) }; and 5*0 ,D2 ={ (i2 , B) , (ts , B) , (t4 , B) }. Fur- 
thermore, \Sd,Di \ < \Sd,D2\- n 

The MRIs are the intended clean instances obtained after 
the application of a set of MDs to an initial instance D. 
There is always an MRI for an instance D wrt M [21]. The 
clean or resolved answers to a query are certain for the class 
of MRIs for D wrt M. They are the intrinsically clean an- 
swers to the query. 

Definition 8. [21] Let Q(x) be a query expressed in the 
first-order language L(S) associated to schema S of an in- 
stance D. A tuple of constants a from {/ is a resolved an- 
swer to Q{x) wrt the set M of MDs, denoted D \=m Q[fl], 
iflt D' \= Q[a], for every MRI D' of D wrt M. We denote 
with ResAn{D, Q, M) the set of resolved answers to Q from 
D wrt M. a 

3. ON THE COMPLEXITY OF RAP 

Notice that the number of MRIs can be exponential in the 
size of the instance, as the next example shows. 

Example 3. (example 1 continued) The example can be 
generalized with the following instance: 
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R[B] = This instance with 2n tuples has 2" MRIs. 



Checking the possibly exponentially many MRIs for an in- 
stance to obtain resolved answers is inefficient. We need 
more efficient algorithms. However, this aspiration will be 
limited by the intrinsic complexity of the problem. In this 
work we investigate the complexity of computing resolved 
answers to queries. We concentrate on the resolved answer 
problem (RAP), about deciding if a tuple is a resolved an- 
swer. 

Definition 9. For a query Q{x) £ L{S), and M, the re- 
solved answer problem is deciding membership of the set: 

RAq^m := {{D, a)\a£ ResAn{D, Q, M)}. □ 



A different decision problem, closely related to RAP, was 
shown to be intractable when there is more than one MD 
[21]. This is because new similarities can arise between val- 
ues as a result of a particular choice of update values (rather 
than because the values were identified as duplicates and 
merged). Such similarities are called accidental similarities 
[21]. As we will see, this dependence of updates on the 
choice of update values for previous updates may make RAP 
intractable. 

Example 4- (Example 1 continued) For instance D2, a 
similarity for attribute B is "accidentally" created for tuples 
t2,t3. a 

Since duplicate resolution involves modifying individual val- 
ues, an important problem is to decide which of these values 
are the same in all MRIs. It is obviously related to the RAP 
problem, and sheds light on its complexity. More precisely, 
for a fixed predicate R, and A an attribute of R in position 
i, we consider the unary query Q^-^' 



3x1 ■ ■ ■ Xi-iXi+i ■ ■ ■ x„R{xi,. . . ,Xi-i,Xi,Xi+i, . . . ,x„), (3) 
i.e. the projection of R on A; and a special case of RAP: 



RAm"^ = {{D, a) \ae ResAn{D, Q"-^(a;0, M)}. (4) 

Intractability of simple single-projected atomic queries like 
(3), i.e. of RA^^, restricts the general efficient applica- 
bility of duplicate resolution. On the other hand, we will 
show (cf. Sections 4, 5) that, for important classes of con- 
junctive queries and for sets of MDs such that RA'^j^ can 
be efficiently solved for all R and A, the resolved answers to 
queries in the class can be efficiently computed. For this rea- 
son, we concentrate on the following classification of MDs. 

Definition 10. A set M of MDs is hard if, for some pred- 
icate R and some attribute A of R, RA^j'"^ is A''P-hard (in 
data). M is easy if, for each R and A, RAf.j^ can be solved 
in polynomial time.^ □ 

In the next subsections, we develop syntactic criteria on 
MDs for easiness/hardness (cf. Theorems 1, 2, and Defi- 
nition 16). Some of these complexity results will be gener- 
alized in Section 4 to larger classes of conjunctive queries. 

3.1 Acyclic IVIDs and a dichotomy result 

Non-interacting (NI) sets of MDs (cf. Section 2) are easy, 
due to the simple form of the MRIs, each of which can be 
obtained with a single update. So, sets of duplicate values 
can be identified simply by comparing pairs of tuples in the 
given instance, to see if they satisfy the similarity relations. 
The minimality condition implies that each such set of dupli- 
cate values must be updated to (one of) the most frequently 
occurring value(s) among them. The simplest non-trivial 
case is a linear pair of two MDs. 

Definition 11. A linear pair M of MDs is such that 
MDG{M) consists of the vertices mi and 7712 with an edge 
from mi to m^. The linear pair is denoted by (mi,m2). □ 

The case of linear pairs is non-trivial in the sense that it 
can be hard (cf. Theorem 2). In this section, we show that 

^The problem used here to define hard/easy is slightly dif- 
ferent from, and more appropriate than, the one used in [21]. 
Here hardness refers to Turing reductions. 



tractability for linear pairs occurs when the form of the MDs 
is such that it prevents accidental similarities generated in 
one update from affecting subsequent updates (cf. Theorem 
1). Deciding whether or not a linear pair has this form 
is straightforward. Although all results of this section are 
stated for MDs involving two distinct predicates, they can 
easily be extended to the case of single relation.* 

Example 5. Consider the following linear pair (mi, m2) of 
MDs and instance: 

R[A] = S[E] -> R[B] = S[F], 
R[B] = S[F] ^ R[C] = S[G]. 
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Different instances can be produced with a single update, 
depending on the choice of common value. Two of those 
instances are: 
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These two updates lead to different sets of tuples with dupli- 
cate values for the R[C] and S[G] attributes to be matched, 
{t\,t2} and {t3,ti} in the case of R' , and {ti,t2,ti,t4\ in 
the case of R" . In general, the effect of the choice of update 
values for the R[B] and S[F] attributes on subsequent up- 
dates for the R[G] and S[G\ attributes leads to intractability. 
Actually, this linear pair will turn out to be hard (cf. below). 

However, an easy set of MDs can be obtained by intro- 
ducing the similarity condition of mi into m2: 

mi : R[A] = S[E] R[B] = S\F], 

m'a : R[A] = S[E] A R[B] = S[F] R[G] = S[G]. 

The accidental similarity between, for example, t2[F] in S" 
and t3[B] in R" cannot affect the update on the R[C] and 
S[G\ attribute values of these tuples, because the S[E\ at- 
tribute value of t2 and the R[A\ attribute value of are 
dissimilar. In effect, the conjunct R[A\ = S[E] "filters out" 
the accidental similarities generated by application of mi, 
preventing them from affecting the update on the R[G] and 
S[G] attribute values. □ 

In general, any linear pair (mi, m2) for which the similarity 
condition of mi is included in that of m2 is easy [21]. Al- 
though linear pairs (mi,m2) are, in general, hard, the pre- 
vious example shows that they can be easy if all attributes 
in LHS{m\) also occur in LHS{m2). We now generalize this 
result showing that, when all similarity operators are tran- 
sitive, a linear pair can be easy iff a subset of the attributes 
of LHS(m\) axe in LHS{m2). 

Transitivity is not necessarily assumed for a similarity re- 
lation. In consequence, it deserves a discussion. Transitiv- 
ity in this case requires that two dissimilar values cannot be 

*This is done by treating the relation as two different rela- 
tions with identical tuples and attributes. For example, the 
condition S[A] ~ S[B] is interpreted as ^^[Al] ~ SrIBr]. 
All complexity results go through with minor modifications. 



similar to the same value. This imposes a restriction on ac- 
cidental similarities, as the next example shows, extending 
the set of tractable cases. 

Example 6. Consider the pair M, and instance D, only 
part of which is shown below. The only similarities are: 
e fa a and em i. So, « is non-transitive. 

mi : R[A] m S[E] A R[B] m S[F] R[C] = S[G] 
mi : R\A\ f» S\G\ A R\C\ ^ S[G] A R[C] ^ R[E] 

R[H] - S[I] 
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The first MD requires an update of each pair in the set 
{(/:i[C], fi+i[G]) \ 1 < I < 7, I odd} to a common value. 
If e is chosen as this value for all pairs, then all pairs of 
tuples, one from R and one from S, would satisfy the simi- 
larity condition of m2, causing the values of t[H] to be up- 
dated to a common value for all tuples in R. However, if 
in the initial update a is chosen as the update value for 
(ti [C], f2[G]) and (tsfC], t4[G]), and i is chosen as the up- 
date value for (tsp], tefG]) and (t7[G], t8[G]), then the value 
of {ti[H],t3[H]} and that of {t5[H],t7[H]} will be updated 
independently of each other. If ~ were transitive, this would 
always be the case, leaving fewer possibilities for updates. 
□ 

Most similarity relations used in ER are not transitive [14]. 
While this restricts the applicability of the tractability re- 
sults presented in this subsection, they could still be ap- 
plied in situations where the non-transitive similarity rela- 
tions satisfy transitivity to a good approximation, for the 
specific instance at hand. 

Consider Example 6, assuming string-valued attributes, 
and « defined as the property of being within a certain 
edit distance, which is not transitive. Accidental similarities, 
such as the one in Example 6, may arise in general. However, 
one could expect the edit distance between duplicate values 
within the R[A] column to be very small relative to that 
between non-duplicate values. This would be the case if 
errors were small within those columns. In such a case, the 
edit distance threshold could be chosen so that the duplicate 
values would be clustered into groups of mutually similar 
values, with a large edit distance between any two values 
from different groups. 

In Example 6, if a and i are dissimilar, the pair of similar- 
ities e « a and e ~ i that led to the accidental similarities 
when e was chosen as the update value would be unlikely 
to occur. Since such accidental similarities, which are pre- 
cluded when f» is transitive, are rare in this case, they would 
affect only a few tuples in the instance. In consequence, a 
good approximation to the resolved answers would be ob- 
tained by applying a polynomial time algorithm that returns 
the resolved answers under the assumption that ~ is transi- 
tive. In this paper we do not investigate this direction any 
further. The easiness results (but not the hardness results) 
presented in this section require the assumption of transitiv- 
ity of all similarity operators. They do not hold in general 
for non-transitive similarity relations. 



Definition 12. Let m be an MD. The symmetric binary 
relation LRelm (RRelm) relates each pair of attributes A 
and B such that a conjunct of the form R[A] « S[B] (-R[^] = 
S'[-B]) appears in LHS{m) {RHSijn)). An L-component {R- 
component) of m is an equivalence class of the reflexive, tran- 
sitive closure, LRel^^l {RRel"^), of LRelm (RRelm). □ 

Lemma 1. A linear pair {mi, mi) of MDs, with ~i and 
~2 transitive, and R, S distinct relations, 

mi : RlA] f»i S[B] R[C] = S[E] 

mi : ~2 S[G] ^ R[H] = S[I] 

is easy if the following holds: If an attribute of R (5) in 
RHSimi) occurs in LHS{m2), then for each L-component 
L of mi, there is an attribute of R (S) from L that belongs 
to LHS{mi). □ 

Example 7. Assuming that ~ is transitive, the following 
linear pair of MDs: 

mi : R[A] m S[B] A R[C] m S[B] A R[E] m S[F] 

R[G] = S[H], 

mi : R[G] m S[H] A R[A] m S[B] A R[E] m S[F] 

R[I] - S[J] 

is easy, because Lemma 1 applies. Here, the L-components 
of mi are {R[A], R[C],S[B]} and {R[E],S[F]}. Here, 
LHSimi) includes both an attribute of R and an attribute 
of S from each of these L-components. □ 

Lemma 1 generalizes the idea of Example 5, where with 
(mi,m2), accidental similarities are "filtered out" and can- 
not affect updates. In some cases, a linear pair of MDs can 
be easy despite the presence of accidental similarities which 
can affect subsequent updates. This happens when an at- 
tribute must take on a specific value in order to affect further 
updates. Definitions 13 and 14 syntactically capture this in- 
tuition. TC{r) denotes the transitive closure of a binary 
relation r. 

Definition 13. Let (mi,m2) be a linear pair of MDs of 
the form rm : R[A] «i S[C] R[E] = S[F] 
mi ■ R[G] ^1 S[H] R[L] = S[J] 

(a) For predicate R, Br is a binary relation on attributes of 
R: For attributes R[Ai] and R[Ai], Br{R[Ai], R[Ai]) holds 
iff -R[Ai] and i?[^2] are in the same R-component of mi 
or the same L-component of m2. Relation Bs is defined 
analogously for predicate S. 

(b) An equivalent set (ES) of attributes of (mi,m2) is an 
equivalence class of TC{Bii) or of TC{Bs), with at least one 
attribute in the equivalence class belonging to LHSijni). □ 

Notice that relations Br and Bs are reflexive and symmetric 
binary relations on attributes in RHS{m\) U LHSimi). 

Example 8. Consider the following linear pair of MDs on 
relations R\A, C, E, G, H] and S[B, D, F, I]: 

R[A] m S[B] R[C] = S[D] A R[E] = S[D] 

R[E] m S[F] A R[G] m S[F] ^ R[H] = S[I] 

The attributes of R satisfy the relations Br{R[C], R[E]) 
(due to R[C] = S[D] and R[E] = S[D]) and Br{R[E], R[G]) 
(due to R[E] m S[F] and R[G] m S[F]). Relation Bs is 
empty, since there is only one attribute of 5* in each of 
RHS{m\) and LHS{mi). There is one non-singleton ES, 
{R[C],Ii[E\,R[G]}, and also the singleton ES {S[F\}. □ 



An ES is a natural unit that groups together the attributes 
of a linear pair with transitive similarities, because of the 
close association between the update values for them. For 
a linear pair as in Definition 13, the set of values which a 
tuple t in relation R takes on the attributes within an R- 
component of mi must be modified to the same value if 
any of the values is modifiable. Also, by transitivity, the 
attributes of t in RHS{m2) are not modifiable by m2 unless 
the values taken by t on the attributes in an L-component 
of m2 are similar (of. Example 9 below). Therefore, when 
considering updates that affect the values of attributes in 
RHS{m2), the values for a given tuple of attributes within 
an ES of attributes can be assumed to be similar. 

Example 9. (example 6 continued) We illustrate the asso- 
ciation between values of attributes in an ES, and also how 
the presence of an ES of a certain form can simplify updates. 

With the given instance and set M of MDs, we now as- 
sume that is transitive. M has the ES {-R[^], -R[C]}. For 
any tuple t of R, the value of t[A\ must be similar to that of 
t[C] in order for there to be a tuple t' in 5* such that t and t' 
satisfy the similarity condition of m2. This is because they 
must both be similar to the value of t'[G], and then must 
be similar to each other by transitivity. If there is no such 
tuple t' , then by Definition 2, t[H] is not modifiable, and by 
Definition 3, the value of t[H] does not change. 

M does not satisfy the condition of Lemma 1. Here, unlike 
those for which Lemma 1 holds, the application of the MDs 
can result in accidental similarities between pairs of modi- 
fiable values in R that can affect further updates. This is 
because only -R[^], not both R[A\ and R[B], is in LHS{m2) 
(cf. Lemma 1). For example, when mi is applied to the 
instance, if both the pair t\ [C] and t2[G], and the pair tafC] 
and tilG] are updated to a, there will be an accidental sim- 
ilarity between ti[C] and tafC], forcing to update t\[H] and 
is [H] to a common value. 

Despite these accidental similarities, updates are made 
simpler by the fact that the ES contains -R[^], an attribute 
in LHS{m\). All sets of tuples in R whose values for R[C] are 
matched must have the same value for R[A\. After these val- 
ues are merged, regardless of the common value chosen, ei- 
ther all tuples in the set will have their R[H] values changed, 
or none of them will change. This would not be true in gen- 
eral if there were no attribute of LHS{mi) in the ES. In that 
case, there could be many possible outcomes depending on 
the value chosen for a set of duplicate values of R[C]. □ 

Example 9 shows how, for a linear pair (mi,m2), the pres- 
ence of an attribute of LHS{m\) in an ES can simplify up- 
dates. This motivates the next definition. 

Definition 14- Let (mi,m2) be a linear pair of MDs on 
relations R and S. An ES E of (mi,m2) is bound if i? n 
LHS{mi) is non-empty. □ 

Example 10. Consider the following linear pair of MDs 
defined on R[A, C, F,H,I, M] and S[B, D, E,G, N]: 

R[A] ^ S[B] R[G] = S[D] A 

R[G] = S[E] A R[F] = S[G] A R[H] = S[G], 
R[F] ^ S[E] A R[I] ^ S[E] A R[A\ ^ S[E] A 

R[F] ^ S[B] -> R[M] = S[N]. 

The ES {S[D],S[E],S[B]} is bound, because it contains 
SIB]. The ES {R[A], R[F], R[I], R[H]} is bound, because 
it contains i?[j4]. □ 



Lemma 2. A linear pair (mi,m2) of MDs as in Lemma 1 
is easy if all ESs are bound. □ 

Example 11. (examples 6 and 9 continued) If ~ is tran- 
sitive, it follows from Lemma 2 that M in Example 6 is 
easy. As we verified in Example 9, M does not satisfy the 
conditions of Lemma 1. □ 

M of Example 6 does not satisfy the conditions of Lemma 
1, but satisfies those of Lemma 2. On the other hand, M of 
Example 7 satisfies the conditions of Lemma 1, but not those 
of Lemma 2. However, M of Example 10 satisfies both. This 
shows that the two easiness conditions are independent, but 
not mutually exclusive. Actually, Lemmas 1 and 2 combined 
give us the following result, which subsumes each of them. 

Theorem 1. Let (mi, m2) be a linear pair as in Lemma 1. 
For predicate R, let Er be the class of ESs of (mi, m2) that 
are equivalence classes of TC{Bii). Es is defined similarly 
using Bs.^ (mi, 1712) is easy if both of the following hold: 

(a) At least one of the following is true: (i) there are no 
attributes of R in RHS{mi) n LHS{mii)- (u) all ESs in Er 
are bound; or (iii) for each L-component L of mi, there is 
an attribute of i? in L n LHS{m2). 

(b) At least one of the following is true: (i) there are no 
attributes of S in RHS{mi) n LHS{m2); (ii) all ESs in Es 
are bound; or (iii) for each L-component L of mi, there is 
an attribute of S in L n LHS{m,2). □ 

In the rest of this section, we will obtain a partial converse of 
Theorem 1. For this purpose, we make the assumption that, 
for each similarity relation, there is an infinite set of mutu- 
ally dissimilar elements. Strictly speaking, the results below 
require only that the set of mutually dissimilar elements be 
at least as large as any instance under consideration. This is 
assumed in our next hardness result for certain linear pairs. 
We expect this assumption to be satisfied by many similar- 
ity measures used in practice, such as the edit distance and 
related similarities based on string comparison. 

The proof is by polynomial reduction from a decision prob- 
lem that we call Cover Set (CS) that is related to the well- 
known minimum set-cover (MSG). Given I — {l/l,C, S), with 
W is a set, C a collection of subsets of U whose union is U, 
and S G C, the problem is deciding whether or not there is a 
minimum (cardinality) set cover S' for {U,S) with S G S'. 
This problem is A^P-complete.^ The reduction constructs a 
finite database instance D, where every pair of values in it 
that are different are also dissimilar. However, a value may 
appear more than once. Gertain values in D are associated 
with elements of W or C. This reduction is indifferent to 
whether or not the similarity relations are transitive, since 
distinct values in the instance are dissimilar, and equal val- 
ues are similar by equality subsumption. 

Theorem 2. Assume each similarity relation has an infi- 
nite set of mutually dissimilar elements. Let (mi, m2) be 
a linear pair of MDs with RHSirm) n RHS{m2) = 0. If 
(mi,m2) does not satisfy the condition of Theorem 1, then 
it is hard.^ □ 

^Thus, elements of Er are ESs in the sense of Definition 
13(b), but for TC{Br) as opposed to TC(Br) U TC{Bs). 
®Gf. Lemma 4 in the appendix. 

^The assumption RHS{mi) n RHS{m2) = is used to en- 



Example 12. We can apply Theorem 2 to identify hard 
sets of MDs. (Assuming for each similarity relation involved 
an infinite set of mutually dissimilar elements.) 

The set of MDs in Example 5 is hard, because condition 

(a) of Theorem 1 does not hold, because all of the following 
hold: (i) there is an attribute, R[B] of R, in RHSijni) n 
LHS{m2); (ii) the ES {R[B\} is not bound; and (iii) there 
is no attribute of R in the L-component {7?[A], ^[-E]} that 
belongs to LHS{m2). 

The set of MDs in Example 6 is hard, because condition 

(b) of Theorem 1 does not hold, because all of the following 
hold: (i) there is an attribute, S[E] of S, in RHS{mi) n 
LHS{m2); (ii) the ES {S[EW is not bound; and (ui) there 
is no attribute of S in the L-component ^[C]} that 
belongs to LHS{m2). 

The set of MDs in Example 8 is hard, because condition 
(a) of Theorem 1 does not hold, because all of the following 
hold: (i) there are attributes of R in RHS{m\) n LHS{mi>); 
(ii) the ES {R[C],R[E],R[G]} is not bound; and (iii) there 
is no attribute of R in the L-component {_R[yl], ^[-B]} that 
belongs to LHS{m2). □ 

Theorem 2 does not require the transitivity of the similarity 
relations, which is needed for tractability. Theorems 1 and 
2 imply the following dichotomy result. It tells us that for 
a syntactic class of linear pairs, each of its elements is easy 
or hard. That is, there is nothing "in between", which is not 
necessarily true in general. Actually, if P 7^ NP, there are 
decision problems in A^P between P and A''P-complete [23]. 

Theorem 3. Assume each similarity relation is transitive 
and has an infinite set of mutually dissimilar elements. Let 
(mi, 7712) be a linear pair of MDs with RHS{m\)V\ RHS {1x12) 
— 0. Then, (mi,m2) is either easy or hard. □ 

Theorem 3 divides the class of linear pairs satisfying cer- 
tain conditions into an easy class, and a hard one. Deciding 
the membership of either of them requires a simple syntac- 
tic checking procedure. The dichotomy result shows that 
very simple pairs of MDs, even ones such as mi and m2 in 
Example 5, with equality as similarity, are hard. 

Given the high computational complexity of RAP for sets 
of two MDs, an important question is whether or not larger 
sets of interacting MDs can be easy. We provide a positive 
answer to this question in the next subsection. In the rest 
of the paper, we do not assume transitivity of similarity 
relations. 

3.2 Cyclic sets of MDs 

We described above how acyclic sets of MDs can be easy 
if the possible effects of accidental similarities are restricted. 
Here, we present a different class of easy sets of MDs for 
which such effects are not restricted. Actually, we establish 
the somewhat surprising result that certain cyclic sets of 
MDs are easy. In this section we do not make the assumption 
that each MD involves different predicates. 

Definition 15. A set M of MDs is simple-cycle (SC) if its 
MD graph MDG{M) is (just) a cycle, and: (a) in aU MDs 
in M and in all their corresponding pairs, the two attributes 

sure that a resolved instance is always obtained after a fixed 
number of updates (actually two), making it easier to re- 
strict the form MRIs can take. This is used in the hardness 
proofs. 



(and predicates) are the same; and (b) in all MDs m £ M, 
at most one attribute in LHSim) is changeable. □ 

Example 13. For schema R[A,C,F,G\, consider the fol- 
lowing set M of MDs: 

mi : R[A] K, R[A] R[C, F, G] = R[C, F, G], 

m2 : R[C] ^ R[C] R[A, F, G] = R[A, F, G]. 

MDG{M) is a cycle, because the attributes in RHS{m2) 
appear in LHS(mi), and vice-versa. Furthermore, M is SC, 
because each of LHSijni) and LHS{m2) are singletons. □ 

For SC sets of MDs, it is easy to characterize the form taken 
by an MRI. 

Example I4. Consider the instance D and a SC set of 
MDs, where the only similarities are: at ~ Uj, bi ~ bj, di « 
dj, ei ej , with i, j e {1, 2}. 

mi : R[A] ^ R[A] R[B] = R[B], 
ma : R[B] ^ R[B] R[A] = R[A]. 

If the MDs are applied twice, 
successively, starting from D, a 
possible result is: 
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It should be clear that, in any sequence of instances Di, D2, 
. . ., obtained from D by applying the MDs, the updated in- 
stances must have the following pairs of values equal (shown 
through the tuple ids): 



Di i odd 
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tuple (id) pairs 


(1,4), (2,3) 


(1,2), (3,4) 


Di i even 
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tuple (id) pairs 


(1,2), (3,4) 


(1,4), (2,3) 



Table 1: Table of matchings 

In any stable instance, the pairs of values in the above tables 
must be equal. Given the alternating behavior, this can only 
be the case if all values in A are equal, and similarly for B, 
which can be achieved with a single update, choosing any 
value as the common value for each of A and B. In partic- 
ular, an MRI requires the common value for each attribute 
to be set to a most common value in the original instance. 
For D there are 16 MRIs. 

Set M is easy: For any given instance D, a table like Table 
1 can be constructed, and using it, the sets of duplicate val- 
ues (i.e. values that are different, but should be equal) in the 
R[A] and R[B] columns can be matched in quadratic time. 
Given those sets of duplicate values, and without having to 
actually match them, the resolved answers to the (single- 
projected atomic) queries 3yR{x,y) and 3xR{x,y) can be 
obtained from those values that occur within a (possibly 
singleton) set of duplicates more often than any other value. 
For instance D, these queries return the empty set. □ 




Figure 1: The MD-graph of an HSC set of MDs 



Proposition 1. Simple-cycle sets of MDs are easy. □ 

The proof of this proposition can be done directly using an 
argument such as the one given for Example 14. However, 
this result will be subsumed by a similar one for a broader 
class of MDs (cf Definition 16). SC sets of MDs can be 
easily found in practical applications. 

Example 15. (example 13 continued) The relation R sub- 
ject to the given A'l, has two "keys", R[A] and i?[C]. A 
relation like this may appear in a database about people: 
R[A] could be used for the person's name, R[C] the address, 
and R[F] and R[G] for non-distinguishing information, e.g. 
gender and age. Easiness of Af can be shown as in Example 
14, and also follows from Proposition 1. □ 

We show easiness for an extension of the class of SC MDs. 

Definition 16. A set M of MDs with MD-graph Mi?G(Af) 
is hit-simple-cyclic (HSC) iff: 

(a) M satisfies conditions (a) and (b) in Definition 15; and 

(b) each vertex vi in MDG{M) is on at least one cycle or is 
connected to a vertex 112 on a cycle of non-zero length by an 
edge directed toward 112. Q 

Notice that SC sets are also HSC sets. An example of the 
MD graph of an HSC set of MDs is shown in Figure 1. 

As the previous examples suggest, it is possible to provide 
a full characterization of the MRIs for an instance subject 
to an HSC set of MDs, which we do next. It will be used to 
prove that HSC sets of MDs are easy (cf. Theorem 4). For 
this result, we need a few definitions and notations. 

For an SC set M and m £ M, if a pair of tuples satisfies 
the similarity condition of any MD in M, then the values 
of the attributes in RHS{m) must be merged for these tu- 
ples. Thus, in Example 14, a pair of tuples satisfying either 
R[A] R[A] or R[B] ^ R[B] have both their R[A] and R[B] 
attributes updated to the same value. More generally, for 
an HSC set M of MDs, and m £ M, there is only a subset of 
the MDs such that, if a pair of tuples satisfies the similarity 
condition of an MD in the subset, then the values of the at- 
tributes in RHS(m) must be merged for the pair of tuples. 
We now formally define this subset. 

Definition 17. Let M be a set of MDs, and m £ M. The 
previous set of m, denoted PS{m), is the set of all MDs 
m' e M with a path in MDG{M) from m' to m. □ 

When applying a set of MDs to an instance, consistency 
among updates made by different MDs must be enforced. 
This generally requires computing a transitive closure rela- 
tion that involves both a pair of tuples and a pair of at- 
tributes. For example, suppose mi has the conjunct R[A\ = 



S[E\ and 7712 has the conjunct R[C] = S[B]. If ti and t2 
satisfy the condition of mi, and t2 and is satisfy the con- 
dition of m2, then t\[A\ and tz\C] must be updated to the 
same value, since updating them to different values would 
require t[B\ to be updated to two different values at once. 
We formally define this relation.* 

Definition 18. Consider an instance D, and M — {mi, m2, 
. . . , m„}, with 

m,: R[A,] S[B,] R[a] = S[E,]. 

(a) For ti,t2 e D, {ti,C\) ^' {t2,E,) :^ ti[Aj] t2[Bj], 
where {Ci,Ei) is a corresponding pair of {Ci,Ei) in rui and 
mj £ PS{mi). (b) The tuple-attribute closure of M wrt D, 
denoted TA^^'^, is the reflexive, transitive closure of ~'. □ 

Notice that ~' and TA^''^ are binary relations on tuple- 
attribute pairs. To keep the notation simple, we will omit 
parentheses delimiting tuple/attribute pairs in elements of 
rpj^M.D (gjjjjply written as TA). For example, for tuples 
ti — R{a,b,c) and t2 = S{d,e,f), with attributes A,C for 
R,S, resp., K4((ti, A), (i2, C)) is simply written as TA{ti,A, 
t2,C); and similarly, TA{{{a,b,c),A),{{d,e,f),C)) as TA{a 
,b,c,A,d,e,f,C). 

In the case of NI and HSC sets of MDs, the MRIs for 
a given instance can be characterized simply using the tu- 
ple/attribute closure. This result is stated formally below. 

Proposition 2. For M NI or HSC, and D an instance, each 
MRI for D wrt M is obtained by setting, for each equivalence 
class E of TA*^'°, the value of aU t[A] for {t,A) £ E to one 
of the most frequent values for t[A] in D. □ 

Example 16. (Example 14 continued) In this example, we 
represent tuples by their ids. We have 

TA'''° = {{i,A,j,A) I l<i,j<4}U 

{{i,B,j,B) I 1 <i,i <4}, 
whose equivalence classes are {{i,A) | 1 < z < 4} and 
{{i, B) \1 < i < 4}. From Proposition 2 and the requirement 
of minimal change, the 16 MRIs are obtained by setting all 
R[A] and R[B] attribute values to one of the four existing 
(and, actually, equally frequent) values for them. □ 

Proposition 2 implies that for NI and HSC sets of MDs, 
the set E of sets of positions in an instance whose values are 
merged to produce an MRI is the same for all MRIs (but the 
common values chosen for them may differ, of course). This 
does not hold in general for arbitrary sets of MDs. Moreover, 
E can be computed by taking the transitive closure of a 
binary relation on values in the instance, an 0{n^) operation 
where n is the size of the instance. Given E, the resolved 
answers to the query Q^-^ are obtained as follows. For 
a tuple t and attribute A, the value v, with t[A] — v, is a. 
resolved answer iff for the equivalence class 5* of TA to which 
{t,A) belongs, for any v' ^ v, \{{t',B) G S \ t'[B] = v}\ > 
\{{t',B) e S \ t'[B] = v'}\. These observations lead to the 
following result. 

Theorem 4- HSC and NI sets of MDs are easy. □ 

*This relation is actually more general than needed for HSC 
sets of MDs, since each corresponding pair has the same 
attributes. However, the more general case is needed when 
discussing NI sets of MDs. 



Theorem 4, does not imply that the set of all MRIs can be 
efficiently computed. Because there can be 0{n) choices of 
update value for each equivalence class of tuple/attribute 
closure, and 0{n) such equivalence classes, there can be ex- 
ponentially many MRIs. 

It may seem counterintuitive that HSC sets are easy in 
light of the fact that analogous non-cyclic cases such as the 
linear pair {m\,m2) of Example 5 are hard. Indeed, while 
tractability occurs in non-cyclic cases when accidental sim- 
ilarities are "filtered out" and cannot affect the duplicate 
resolution process, cyclic cases are easy for the opposite rea- 
son: all possible accidental similarities are imposed on the 
values as these similarities are propagated to all attributes 
in the MDs on the cycle. Thus, the intractability arising 
from having to choose common values so as to avoid certain 
accidental similarities is removed. 

The tuple/attribute closure of Definition 18 can be de- 
fined using a Datalog program, which we can use for query 
rewriting (cf. Section 4). Let M be as in Definition 18. 
Without losing generality and to simplify the presentation, 
we will assume in the rest of this section that predicates R 
and S are the same, so that we can keep them implicit. 

The facts of the Datalog program, fl^^, are the ground 
atoms R{a) in the original instance D, plus the facts of the 
form c Kii d, that capture the similarity, in the sense of «i, 
of a pair of tuples c and d occurring in D. Furthermore, 
n™ contains, for each rrii £ M, for each corresponding pair 
R[A] = R[B] in rrii, and for each rrij £ PS{mi), the rule 
{x, A) «' {y, B) ^ R{x), R{y), x y. 

The tuple/attribute closure TA^''''''' is given in Datalog as 
TA{x,A,y,B)^{x,A) ^' {y,B). 

TA{x,A,z,C)^ TA{x,A,y,B), {y,B)^' {z,C). 

Is it easy to verify that this program is finite and positive; 
and that all its rules are safe, in the sense that all vari- 
ables appear in positive body atoms. The single minimal 
model of the program can be computed bottom-up, as usual. 
This model captures the sets of value positions to be merged 
which, as pointed out previously, are the same for all MRIs 
of an instance to which a NI or HSC set of MDs applies. 

Example 1 7. (examples 14 and 16 continued) For the MDs 
and instance of Example 14, the facts of 11™ are 1 ~i 2, 
3 «i 4, 1 ~2 4, and 2 ~2 3, where ~i denotes the similarity 
condition of rrii, in addition to the ground atoms in D. Ap- 
plying n]^ gives [i. A) ^' {i mod 4 -I- 1, A) and (i, B) «' {i 
mod 4 + 1,B), 1 < i < 4. Applying the rule for TA we 
reobtain the classes in Example 16. □ 

This suggests a declarative specification of the resolved an- 
swers: Given a conjunctive query, the query is rewritten by 
incorporating the Datalog rules above. The combination re- 
trieves the resolved answers to the original query. In the 
next section, we will develop this approach for both NI and 
HSC sets of MDs, to rewrite a query into one that retrieves 
the resolved answers to the original query. We will be able 
to provide both a query rewriting methodology, and also an 
extension of the tractability results of this section (that re- 
fer to single-projected atomic queries) to a wider class of 
conjunctive queries. 

In this section we presented an algorithm that, taking 
as input an instance D and an HSC set of MDs, identifies 
the sets of duplicates (i.e. sets of values that have to be 



matched) in time 0{n^), with n = \D\. This entails the 
easiness of such sets of MDs (cf. Theorem 4). We also intro- 
duced a Datalog program that can be used to identify the 
duplicate sets, as an alternative to updating the instance. 
The algorithm for duplicate set identification can be eas- 
ily extended into one that computes the set of all MRIs for 
a given instance D. As expected, the combination of the 
choices of common values may lead to an exponential num- 
ber of MRIs for D. 

4. RESOLVED QUERY ANSWERING 

Here, we consider the two classes of easy sets of MDs: NI 
and HSC sets of MDs. We will take advantage of the results 
of Section 3.2, to efficiently retrieve the resolved answers to 
queries in the UJCQ class of conjunctive queries (cf. Defini- 
tion 19). It extends the single-projected atomic queries (3), 
which have a tractable RAP, by Theorem 4. 

More precisely, we identify and discuss tractable cases of 
RAq^m for HSC and NI sets of MDs, and a certain class of 
conjunctive queries Q. Actually, we present a query rewrit- 
ing technique for obtaining their resolved answers. It works 
as follows. Given an instance D and a query Q, the MRIs 
for D are not explicitly computed. Instead, Q is rewritten 
into a new query Q' , using both Q and M. Query Q' is such 
that when posed to D (as usual), it returns the resolved an- 
swers to Q from D. Q' may not be a conjunctive query 
anymore. However, if it can be efficiently evaluated against 
D, the resolved answers can also be efficiently computed.^. 
In our case, the rewritten queries will be (positive) Data- 
log queries with aggregation (actually. Count) . They can be 
evaluated in polynomial time, making RAq^m tractable. 

The queries Q will be conjunctive, without built-in atoms, 
i.e. of the form Q{x) : 3u{Ri{vi) A ■ ■ ■ A Rn{vn)), with 
Ri G TZ, and x = (UVi) \ u. Some additional restrictions 
on the joins we will be imposed below, to guarantee the 
tractability of RAq^m- 

Definition 19. Let Q be a conjunctive query, and M a set 
of MDs. Query Q is an unchangeable join conjunctive query 
if there are no existentially quantified variables in a join in 
Q in the position of a changeable attribute. UJCQ denotes 
this class of queries. □ 

Example 18. For schema <S = {-R[y4, B]}, let M consist 
of the single MD R[A] « R[A\ R[B] = R[B\. At- 
tribute B is changeable, and A is unchangeable. The query 
Qi{x,z): 3y{R{x,y) A R{z,y)) is not in t/JCQ, because the 
bound and repeated variable y is for the changeable attribute 
B. However, the query Q2{y) : 3x3z{R{x,y) A R{x,z)) is 
in UJCQ: the only bound, repeated variable is x which is 
for the unchangeable attribute A. If variables x and y are 
swapped in the first atom of Q2, the query is not UJCQ. □ 

We will use the Count{R) operator in queries [1]. It returns 
the number of tuples in a relation R, and will be applied 
to sets of tuples of the form {x \ C}, where x is a tuple 
of variables, and C is a condition involving a set of free 
variables that include those in x. More precisely, for an 
instance D, Count{{x \ C}) takes on D the numerical value 
|{c I D \= C[c]}\. The variables in C that do not appear in 
X are intended to be existentially quantified. A condition C 

^FO query rewriting was applied in CQA, already in [3] (cf. 
[8] for a survey) 



can be seen as a predicate defined by means of a Datalog 
query with the 7^ built-in. For motivation and illustration, 
we now present a simple example of rewriting using Count. 
Throughout the rest of this section, we use the notation of 
Example 16 for the arguments of TA. 

Example 19. Consider R[A, B], m : R[A\ ^ R[A] 
R[B] = R[B], and the UJCQ query Q{x,y,z) : R{x,y,z). 
These are the extensions for R and its (single) MRI: 
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MRI 
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bi 


Cl 




ffli 


62 


Cl 




ai 




C2 




ffli 


b2 


C2 




01 


b2 


C3 




ffli 


b2 


C3 



The set of resolved answers to Q is {(ai,&2,ci), (ai,62,C2), 
(ai, &2, C3)}. The following query, directly posed to the (ac- 
tually, any) initial instance, returns the resolved answers. In 
it, TA stands for TA^™*'' ). 

Q!{x,y,z):3y'R{x,y',z) A Vy"[ (5) 

Count{{x' ,y,z') \ TA{x,y' , z, B,x' ,y, z' , B) AR{x' ,y, z')} > 
Count{{x',y",z') \ TA{x,y' , z, B , x' ,y" , z' , B) AR{x' ,y" , z') 

As we saw in Section 3.2, the TA here can be specified by 
means of a Datalog query. Actually, the whole query can 
be easily expressed by means of a single Datalog query with 
aggregation^'^ and comparison as a built-in. 

Intuitively, the first conjunct requires the existence of a 
tuple t with the same values as the answer for attributes A 
and C. Since the values of these attributes are not changed 
when going from the original instance to an MRI, such a 
tuple must exist. However, the tuple is not required to have 
the same B attribute value as the answer tuple, because 
this attribute can be modified. For example, (ai,62,ci) is a 
resolved answer, but is not in R. What makes it a resolved 
answer is the fact that it is in an equivalence class of value 
positions (consisting of all three positions in the B column of 
the instance) for which 62 occurs more frequently than any 
other value. This counting condition on resolved answers is 
expressed by the second conjunct. Attribute B is the only 
changeable attribute, so it is the only attribute argument to 
TA, which specifies the values to be merged. Query (5) can 
be computed in polynomial time on any instance. □ 

The Rewrite algorithm in Table 2 uses a binary relation on 
attributes, that we now introduce. 

Definition 20. Let M be a set of MDs. (a) The symmet- 
ric binary relation =r is defined on attributes, as follows: 
R[A] S[B] iflt there is m G M with R[A] = S[B] appear- 
ing on the RHS of m's arrow. 

(b) Eji[A] denotes the equivalence class of the reflexive, tran- 
sitive, closure of =r that contains _R[A]. □ 

Example 20. Let M be the set of MDs 
R[A] ^1 S[B] R[C] = S[D], 

S[E] ^2 T[F] A S[G] ^ T[H] S[D, K] = T[J, L], 

T[F] ^3 T[H] T[L, N] = T[M, P]. 

The equivalence classes of Tat are Efi^c] ~ {R[C], S[D] ,T[J]}, 
EsiK] ^ {SIK], T[L],T[M]}, ! ,nd Et[n] = {T[N],T[P]}. □ 

Count queries with group- by in Datalog can be expressed 
by rules of the form Q{x, count{z)) B(x'), where iUfz:} C 
x' , z ^ x, and B is a conjunction of atoms. 



To emphasize the association between a variable and a par- 
ticular attribute, we sometimes subscript the variable name 
with the name of the attribute. For example, given a relation 
R with attributes A and B and atom R{x,y), we sometimes 
write x as xa. To express substitutions of variables within 
lists of variables, we give the name of the variable list, fol- 
lowed by the substitution in square brackets. For example, 
the list of variables obtained from the list v by substitution 
of variables from a subset S of the variables in v with primed 
variables is expressed as v[v ^ v' \ v £ S]. 



Table 2: Rewrite Algorithm 

Rewrite outputs a rewritten query Q' for an input consisting 
of a query Q € UJCQ and set of NI or HSC MDs. It rewrites 
the query by separately rewriting each conjunct Ri{vi) in 
Q. If Riivi) contains no free variables, then it is unchanged 
(line 6). Otherwise, it is replaced with a conjunction involv- 
ing the same atom and additional conjuncts which use the 
Count operator. The conjuncts involving Count express the 
condition that, for each changeable attribute value returned 
by the query, this value is more numerous than any other 
value in the same set of values that is equated by the MDs. 
The Count expressions contain new local variables as well as 
a new universally quantified variable v'^a. 

Example 21. We illustrate the algorithm with predicates 
R[ABC],S[EFG],U[HIl the UJCQ query 

Q{x,y,z): 3t u p q {R{x,y,z) A S{t,u,z) A U{p,q)); 
and the NI MDs: R[A] ^ S[E] ^ R[B] = S[F], and S[E] « 
U[H]^S[F] = U[I]. 

Since the S and U atoms have no free variables holding 
the values of changeable attributes, these conjuncts remain 
unchanged (line 6). The only free variable holding the value 



Input: A query in UJCQ and a NI or HSC set of MDs 

M = {mi, ...TTip}. 

Output: The rewritten query Q' . 

1) Let Q{t} : 3u Ai<i<„ Ri{vi) be the query. 

2) Let TA denote ) 

3) For each Ri{vi) 



4) Let C be the set of changeable attributes of Ri 
corresponding to a free variable in Vi 

5) If C is empty 

6) Q^{Vi) ^ R,{v,) 

7) Else 

8) v'i ^ Vi[ViA v'iA \ A £ C] 

9) Let Vic be the list of variables ViA , A £ C 

10) v'ic ^ v,clv,A ^ vU \ A £ C] 

11) For each variable ViA in Vic 

12) For each attribute e 

13) Generate atom Rjiu'j^), with 
u'jf. a list of new variables 

14) Ujfc ^ U^f,[ujkR^[Bk] ViA] 

15) Wjk u'jk[UjkRj[Bk] V-'a] 

16) Cfk -s- Count{ujk I TA{Vi, 
Ri[A],^k,RjlBk])/\Ri{ujk)} 

17) Cfk ^ Count{wjk I TA{vi, 
iJ, [A] , Wjk , Rj [Bfe] ) A Rj [wjk) 
A«"4 / Via} 

18) Q^{v,) ^ 3v[c{R^[v[) AAec V«f^[E,,feC/fc^ 

> ^j,kCjk ]} 



19) Q'(f) ^ 3?i Ai<i<„ Qi(wi) 

20) return Q! 



of a changeable attribute is y. Therefore, line 8 sets v'l to 
(x,y',z). Variable y contains the value of attribute R[B\. 
The equivalence class E^yg^ is {_R[_B], S[-F], I7[/]}, so the 
loop at line 12 generates the atoms R{x' , y, z'), R{x' , y" , z'), 
S{t',y,z'), S{t',y",z'), Uip',y), U{p' ,y"). The rewritten 
query is obtained by replacing in Q the conjunct R{x, y, z) 
by 3y'{R{x,y\z) AVj/"[ 

Count{{x' ,y,z') \ TA{x,y' , z, R[B], x' ,y, z' , R[B]) A 
R{x',y,z')} + Count{{t',y,z') \ TA{x,y' , z, R[B], 
t',y,z',S[F]) AS{t',y,z')}+ Count{{p',y) \ TA{x,y',z, 
R[B],p',y,UlI])ryU{p',y)} > Count{{x' ,y" , z') \ 
TA{x, y', z, R[B],x',y", z',R[B]) A R{x',y", z') A y" / y} 
+ Count{{t' ,y'\z') I TA{x,y\z,R[B],t\y",z',S[F]) A 
S{t', y", z') A y" ^y}+ Count{{p' ,y") \ TA{x, y' , z, R[B], 
p',2/",[/[J])At/(p',y")Ay"/y}]. □ 

Notice that the resulting query in Example 21, and this is a 
general fact with the algorithm, can be easily translated into 
a Datalog query with the aggregate Count plus the built- 
ins 7^ and >,+, the last two applied to natural numbers 
resulting from counting. The FO part can be transformed 
by means of the Lloyd-Topor transformation [25]. 

Theorem 5. For a NI or HSC set of MDs M and a UJCQ 
query Q, the query Q' computed by the Rewrite algorithm 
is efficiently evaluable and returns the resolved answers to 
Q. □ 

The rewriting algorithm does not depend on the dirty in- 
stance at hand, but only on the MDs and the input query, 
and runs in polynomial time in the size of Q and M . 

In the next section, we will relate RAq^m to consistent 
query answering (CQA) [7, 8]. This connection and some 
known results in CQA will allow us to identify further tractable 
cases, but also to establish the intractability of RAq^m for 
certain classes of queries and MDs. The latter result im- 
plies that the tractability results in this section cannot be 
extended to all conjunctive queries. 

5. A CQA CONNECTION 

MDs can be seen as a new form of integrity constraint 
(IC), with a dynamic semantics. An instance D violates an 
MD m if there are unresolved duplicates, i.e. tuples ti and 
t2 in D that satisfy the similarity conditions of m, but differ 
in value on some pairs of attributes that are expected to be 
matched according to m. The instances that are consistent 
with a set of MDs M (or self- consistent from the point of 
view of the dynamic semantics) are resolved instances of 
themselves with respect to M. Among classical ICs, the 
closest analogues of MDs are functional dependencies (FDs). 

Now, given a database instance D and a set of ICs E, pos- 
sibly not satisfied by D, consistent query answering (CQA) 
is the problem of characterizing and computing the answers 
to queries Q that are true in all repairs of D, i.e. the in- 
stances D' that are consistent with E and minimally differ 
from D [3]. Minimal difference between instances can be 
defined in different ways. Most of the research in CQA has 
concentrated on the case of the set-theoretic symmetric dif- 
ference of instances, as sets of tuples, which in the case of 
repairs is made minimal under set inclusion, as originally in- 
troduced in [3]. Also the minimization of the cardinality of 
this set-difference has been investigated [26, 2]. Other forms 
of minimization measure the differences in terms of changes 



of attribute values between D and D' (as opposed to entire 
tuples) [19, 27, 18, 9], e.g. the number of attribute updates 
can be used for comparison. Cf. [7, 12, 8] for CQA. 

Because of their practical importance, much work on CQA 
has been done for the case where E is a set of functional 
dependencies (FDs), and in particular for sets, /C, of key 
constraints (KCs) [13, 20, 29, 28, 30], with the distance being 
the set-theoretic symmetric difference under set inclusion. In 
this case, on which we concentrate in the rest of this section, 
a repair D' of an instance D becomes a maximal subset of 
D that satisfies K,, i.e. D' CD, D' ^ JC, and there is no 
D" with D' C D" C D, with D" \= IC [13]. 

Accordingly, for a FO query Q{x) and a set of KCs /C, a is 
a consistent answer bom D to Q{x) wrt JC when D' \= Q[a], 
for every repair D' of D. For fixed Q{x) and IC, the consis- 
tent query answering problem is about deciding membership 
in the set CQAq ^ = {{D, a) | a is a consistent answer from 
D to Q wrt IC}. ' 

Notice that this notion of minimality involved in repairs 
wrt FDs is tuple and set-inclusion oriented, whereas the one 
that is implicitly related to MDs and MRIs via the match- 
ings (cf. Definition 7) is attribute and cardinality oriented. ""^^ 
Ifowever, the connection can still be established. In particu- 
lar, the following result can be obtained through a reduction 
and a result in [13, Thm. 3.3]. 

Theorem 6. Consider the relational predicate R[A, B,(J\, 
the MD m: R[A] = R\A] R[B,C] = R[B,C], and the 
aon-UJCQ query Q : 3x3y3y'3z{R{x, y, c) A R{z, y',d) Ay = 
y'). -RAq is coAfP-complete.^^ □ 

For certain classes of conjunctive queries and ICs consist- 
ing of a single KC per relation, CQA is tractable. This is 
the case for the Qorest class of conjunctive queries [20], for 
which there is a FO rewriting methodology for computing 
the consistent answers. Qorest excludes repeated relations 
(self-joins) , and allows joins only between non-key and key 
attributes. Similar results were subsequently proved for a 
larger class of queries that includes some queries with re- 
peated relations and joins between non-key attributes [29, 
28, 30]. The following result allows us to take advantage of 
tractability results for CQA in our MD setting. 

Proposition 3. Let D be a database instance for a single 
predicate R whose set of attributes is AuB, with AnB — 0; 
and m the MD R[A] = R[A] -> R[B] = R[B]. There is 
a polynomial time reduction from _RAq {„i} to CQAq 
where k is the key constraint A ^ B. □ 

Proposition 3 can be easily generalized to several relations 
with one such MD defined on each. The reduction takes an 
instance D for -Rj4q and produces an instance D' for 
CQAq i^j. The schema of D' is the same as for D, but 
the extension of the relation is changed wrt D via counting. 
Definitions for those aggregations can be inserted into query 
Q, producing a rewriting Q' . Thus, we obtain: 

Theorem 7. Let iS be a schema with 71 = {-Ri [Ai, Bi], . . . , 
Rn[A„:B„]} and IC the set of KCs m : Ri[Ai] Ri[Bi]. 
Let Q be a FO query for which there is a polynomial-time 

^^Cf. [21] for a discussion of the differences between FDs and 

MDs seen as ICs, and their repair processes. 
^^This result appeals to many-one or Karp's reductions, in 

contrast to the Turing reductions used in Section 3. 



computable FO rewriting Q' for computing the consistent 
answers to Q. Then there is a polynomial-time computable 
FO query Q" extended with aggregation'^^ for computing 
the resolved answers to Q from D wrt the set of MDs rrii : 
R, [A,] = [A,] R, [B,] = R, [B,] . □ 

The aggregation in Q" in Theorem 7 arises from the generic 
transformation of the instance that is used in the reduction 
involved in Proposition 3, but here becomes implicit in the 
query. 

We emphasize that Q" is not obtained using algorithm 
Rewrite from Section 4, which is not guaranteed to work 
for queries outside the class UJCQ. Rather, a first-order 
transformation of the Ri relations with Count is composed 
with Q' to produce Q" . Similar to the Rewrite algorithm of 
Section 4, it is used to capture the most frequently occurring 
values for the changeable attributes for a given set of tuples 
with identical values for the unchangeable attributes. 

This theorem can be applied to decide/compute resolved 
answers in those cases where a FO rewriting for CQA has 
been identified. In consequence, it extends the tractable 
cases identified in Section 4. It can be applied to queries 
that are not in UJCQ. 

Example 22. The query Q : 3x3y3z{R{x,y) A S{y,z)) 
is in the class Qorest for relational predicates R[A, B] and 
S[C, E] and KCs A ^ B and C E. By Theorem 7 and 
the results in [20], there is a polynomial-time computable 
FO query with counting that returns the resolved answers 
to Q wrt the MDs R[A] = R[A] R[B] = R[B] and 
S[C] =S[C]^ S[E] ^ S[E]. Notice that Q is not in UJCQ, 
since the bound variable y is associated with the changeable 
attribute R[B]. □ 

6. CONCLUSIONS 

Matching dependencies specify both a set of integrity con- 
straints that need to be satisfied for a database to be free 
of unresolved duplicates, and, implicity, also a procedure for 
resolving such duplicates. Minimally resolved instances [21] 
define the end result of this duplicate resolution process. In 
this paper we considered the problem of computing the an- 
swers to a query that persist across all MRIs (the resolved 
answers). In particular, we studied query rewriting meth- 
ods for obtaining these answers from the original instance 
containing unresolved duplicates. 

Depending on syntactic criteria on MDs and queries, trac- 
table and intractable cases of resolved query answering were 
identified. We discovered the first dichotomy result in this 
area. In some of the tractable cases, the original query can 
be rewritten into a new, polynomial-time evaluable query 
that returns the resolved answers when posed to the origi- 
nal instance. It is interesting that the rewritings make use 
of counting and recursion (for the transitive closure). The 
original queries considered in this paper are all conjunctive. 
Other classes of queries will be considered in future work. 

We established interesting connections between resolved 
query answering wrt MDs and consistent query answering. 
There are still many issues to explore in this direction, e.g. 
the possible use of logic programs with stable model seman- 
tics to specify the MRIs, as with database repairs [4, 5, 22]. 

^''This is a proper extension of FO query languages [24, Chap- 
ter 8]. 



We have proposed some efficient algorithms for resolved 
query answering. Implementing them and experimentation 
are also left for future work. Notice that those algorithms 
use different forms of transitive closure. To avoid unaccept- 
ably slow query processing, it may be necessary to compute 
transitive closures off-line and store them. The use of Dat- 
alog with aggregation can be investigated in this direction. 

In this paper we have not considered matching attribute 
values, whenever prescribed by the MDs, using matching 
functions [10] . This element adds an entirely new dimension 
to the semantics and the problems investigated here. 
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APPENDIX 

A. AUXILIARY RESULTS AND PROOFS 

For several of the proofs below, we need some auxiliary 
definitions and results. 

Lemma 3. Let D be an instance and let m be the MD 

R[A] f» S[B] R[C] = R[E] 

An instance D' obtained by changing modifiable attribute 
values of D satisfies {D,D') 1= m iff for each equivalence 
class of Tm, there is a constant vector v such that, for all 



tuples t in the equivalence class, 

t'[C] = u if f G R{D) 

t'[E] = v ate s{D) 

where t' is the tuple in D' with the same identifier as t. 

Proof.Suppose {D, D') N m. By Definition 3, for each pair 
of tuples ti e R{D) and t2 G S{D) such that ti[A] « t2[B], 

t[[c] = tm 

Therefore, if T~(t\,t2) is true, then t'^ and t'2 must be in 
the transitive closure of the binary relation expressed by 
t'i[C\ — t'2[E]. But the transitive closure of this relation is 
the relation itself (because of the transitivity of equality). 
Therefore, t{[C] = t'2[E]. The converse is trivial. □ 
We require the following definitions and lemma. 

Definition 21. Let S be a set and let Si, S2,...Sn be sub- 
sets of S whose union is S. A cover subset is a subset Si, 
1 < i < n, that is inasmallest subset of {Si, S2, ...Sn} whose 
union is S. The problem Cover Subset (CS) is the problem 
of deciding, given a set S, a set of subsets {Si, S2, ...S„} of 
S, and an subset Si, 1 < i < n, whether or not Si is a cover 
subset. □ 

Lemma 4- CS and its complement are A'^P-hard. 

Proof.The proof is by Turing reduction from the minimum 
set cover problem, which is A^P-complete. Let O be an oracle 
for CS. Given an instance of minimum set cover consisting 
of set S, subsets 5*1, 5*2,... 5„ of S, and integer k, the fol- 
lowing algorithm determines whether or not there exists a 
cover of 5* of size k or less. The algorithm queries O on 
{S, {Si, ...Sn}, Si) until a subset Si is found for which O an- 
swers yes. The algorithm then invokes itself recursively on 
the instance consisting of set S\Si, subsets 
{Si, ...Si-i, Si+i, ...Sn}, and integer k — 1. If the input set 
in a recursive call is empty, the algorithm halts and returns 
yes, and if the input integer is zero but the set is nonempty, 
the algorithm halts and returns no. It can be shown using 
induction on k that this algorithm returns the correct an- 
swer. This shows that CS is A'P-hard. The complement of 
CS is hard by a similar proof, with the oracle for CS replaced 
by an oracle for the complement of CS. □ 

Proof of Lemma 1: We assume that an attribute of both 
R and S in RHSirni) occurs in LHS{m2). The other cases 
are similar. For each L-component of mi, there is an at- 
tribute of R and an attribute of S from that L-component 
in LIIS{m2). Let ii G i? be a tuple not in a singleton equiv- 
alence class of Tmi ■ Suppose there exist two conjuncts in 
LIIS{mi) of the form A « B and C « B. Then it must 
hold that there exists t2 £ S such that ti[A] ~ t2[B] and 
ti[C] ^ t2[B] and by transitivity, ti[A] « ti[C]. More gen- 
erally, it follows from induction that ti[A] ~ ti[E] for any 
pair of attributes A and i? of _R in the same L-component 
of mi. 

We now prove that for any pair of tuples ti , t2 G R sat- 
isfying Tm2(ti,t2) such that each of ti and t2 is in a non- 
singleton equivalence class of T^i , for any instance D it 
holds that Tmi{ti,t2). By symmetry, the same result holds 
with R replaced with S. Suppose for a contradiction that 
(ii, i2) but "iTmj {ti,t2) in D. Then it must be true that 
ti[A] t2[A], since, by assumption, there exists a ta G 5 



such that ti[A] ^ t^lB], which together with ti[A] ^ t2[A] 
would imply Tmi{ti,t2)- Therefore, there must be an at- 
tribute A' £ A such that ti [A'] \A!\ , and by the previous 
paragraph and transitivity, ti [A"] 96 t2 [A"] for all A" in the 
same L-component of mi as A' . By transitivity of ~2, this 
implies -iTma (ii, i2), ^ contradiction. 

A resolved instance is obtained in two updates. Let 
and T^, denote before and after the first update, re- 
spectively. The first update involves setting the attributes in 
RHSijni) to a common value for each non-singleton equiv- 
alence class of Tmi ■ The relation T^, will depend on these 
common values, because of accidental similarities. However, 
because of the property proved in the previous paragraph, 
this dependence is restricted. Specifically, for each equiv- 
alence class E of 1 there is at most one non-singleton 
equivalence class E\ of Tm^ such that E contains tuples of 
-El n i? and at most one non-singleton equivalence class E2 
of Tmi such that E contains tuples of _Ei P| S. A given choice 
of update values for the first update will result in a set of 
sets of tuples from non-singleton equivalence classes of Tm^ 
(ns tuples) that are equivalent under • Let K be the set 
of all such sets of ESs. Clearly, \K\ £ 0{n^), where n is the 
size of the instance. 

Generally, when the instance is updated according to mi , 
there will be more than one set of choices of update values 
that will lead to the ns tuples being partitioned according 
to a given k £ K. This is because an equivalence class of 
will also contain tuples in singleton equivalence classes 
of Tmi (s tuples) , and the set of such tuples contained in the 
equivalence class will depend on the update values chosen 
for the modifiable attribute values in the ns tuples in the 
equivalence class. For a set E £ k, let E' denote the union 
over all sets of update values for E of the equivalence classes 
of that contain E that result from choosing that set of 
update values. By transitivity and the result of the sec- 
ond paragraph, these E' cannot overlap for different E £ k. 
Therefore, minimization of the change produced by the two 
updates can be accomplished by minimizing the change for 
each E' separately. Specifically, for each equivalence class E, 
consider the possible sets of update values for the attributes 
in RHSijni) for tuples in E. Call two such sets of values 
equivalent if they result in the same equivalence class Ei of 
T^j. Clearly, there are at most 0{n'^) such sets of ESs of 
values, where c is the number of R-components of mi. Let 
y be a set consisting of one set of values v from each set 
of sets of equivalent values. For each set of values v £ V , 
the minimum number of changes produced by that choice of 
value can be determined as follows. The second application 
of mi and mg updates to a common value each element in 
a set 5*2 of sets of value positions that can be determined 
using lemma 3. The update values that result in minimal 
change are easy to determine. Let 5*1 denote the correspond- 
ing set of sets of value positions for the first update. Since 
the second update "overwrites" the first, the net effect of the 
first update is to change to a common value the value po- 
sitions in each set in {Si \ Si = 'S'\Us'gS2 ^ ^ '^'^J'' 
is straightforward to determine the update values that yield 
minimal change for each of these sets. This yields the mini- 
mum number of changes for this choice of v. Choosing v for 
each E so as to minimize the number of changes allows the 
minimum number of changes for resolved instances in which 
the ns tuples are partitioned according to k to be determined 
in 0{n'^) time. Repeating this process for all other k £ K 



allows the determination of the update values that yield an 
MRI in 0(71'^"'"^) time. Since the values to which each value 
in the instance can change in an MRI can be determined in 
polynomial time, the result follows. □ 

Proof of Theorem 2: For simplicity of the presentation, 
we make the assumption that the domain of all attributes 
is the same. All pairs of distinct values in an instance are 
dissimilar. Wlog, we will assume that part (a) of Theorem 
1 does not hold. Let E and L denote an ES and an L- 
component that violate part (a) of Theorem 1. We prove 
the theorem separately for the following three cases: (1) 
There exists such an E that contains only attributes of mi, 
(2) there exists such an E that contains both attributes not 
in mi and attributes in ?ni, and (3) (1) and (2) don't hold 
(so there exists such an E that contains only attributes not 
in mi). Case (1) is divided into two subcases: (l)(a) Only 
one R-component of mi contains attributes of E and (l)(b) 
more than one R-component contains attributes of E. 

Case (l)(a): We reduce an instance of the compliment 
of CS (cf. definition 21) to this case, which is AfP-hard by 
lemma 4. Let F be an instance of CS with set of elements 
U = {ei, 62, ...e„} and set of subsets V — {/i , /2, .../m}- 
Wlog, we assume in all cases that each element is contained 
in at least two sets. With each subset in V we associate a 
value in the set K = {k\,k2, ...km}- With each element in 
U we associate a value in the set P = {v\,V2, ...«n}. The 
instance will also contain values h and c. 

Relations R and S each contain a set Si of tuples for each 
ei, 1 < i < n. Specifically, there is a tuple in Si for each 
value in K corresponding to a set to which ei belongs. On 
attributes in L, all tuples in Si take the value Vi. There is 
one tuple for each value in K corresponding to a set to which 
Ei belongs that has that value as the value of all attributes 
in the R-component of mi that contains an attribute in E. 
On all other attributes, all tuples in all 5"; take the value h. 

Relation S also contains a set Gi of m other tuples. For 
each value in K, there is a tuple in G\ that takes this value 
on all attributes A such that there is an attribute B £ E 
such that B ■K. A occurs in m2. This tuple also takes this 
value on some attribute Z of 5* in RHS{m2). For all other 
attributes, all tuples in Gi take the value b. 

A resolved instance is obtained in two updates. We first 
describe a sequence of updates that will lead to an MRL It is 
easy to verify that the equivalence classes of Tmi ''■'"^ the sets 
Si. In the first update, the effect of applying mi is to update 
all modifiable values of attributes in RHS{irL\) within each 
equivalence class, which are values of attributes within the 
R-component of mi that contains an attribute of E, to a 
common value. For some minimum set cover G, we choose 
as the update value for a given 5"; the value associated with 
a set in G containing e^. 

Before the first update, there is one equivalence class of 
Tm2 for each value in K. Let Ek be the equivalence class 
for the value k £ K. Ek contains all the tuples in R with 
k as the value for the attributes in i?, as well as a tuple in 
Gi with k as the value for Z. The only R-component of m2 
the values of whose attributes are modifiable for tuples in 
Ek is the one containing the attribute Z. If k is the value in 
K corresponding to a set in the minimum set cover G, then 
we choose b as the common value for this R-component. 
Otherwise, we choose k. 

After the first update, applying mi has no effect, since 



none of the values of attributes in RHS{mi) are modifiable. 
Each equivalence class of consists of a set of sets Si and 
a tuple of Gi . Specifically, for each update value that was 
chosen for the modifiable attributes of RHS{m\) in the first 
update there is an equivalence class that includes the set of 
all Si whose tuples' RHS{m,\) attributes were updated to 
that value as well as the tuple of Gi containing this value. 
Given the choices of update values in the previous update, 
it is easy to see that the values of all attributes in RHS{m2) 
for tuples in these equivalence classes are modifiable after 
the first update unless all the values are b. We choose h as 
the update value. 

It can easily be seen that, in this update process, the 
changes made to values of attributes in RHS(m2) in the 
first update are overwritten by those made in the second 
update. Therefore, the total number of changes made in the 
two updates is the number ni of changes made to the values 
of attributes in mi during the first update plus the number 
of changes 712 made to the attributes of 7712 during the second 
update. The only attributes of m2 whose values change to 
a value different from the original instance in the second 
update are those of attribute Z for tuples in Gi . Since these 
values change iff they occur within a tuple containing one 
of the update values for the Si, is the size of a minimum 
set cover. 

When mi is applied to the instance in the first update, 
the set of values of attributes in the R-component of mi that 
contains an attribute of E for each set of tuples 5"; is updated 
to a common value. Before this update, each such set of 
values includes the values of the sets to which belongs. 
For an arbitrary first update of the instance according to 
mi, consider the set I of Si for which the update value 
occurs within the set. We claim that for an MRI the set 
of update values for I must correspond to a minimum set 
cover for the set of all e; such that Si £ I. Indeed, if these 
values did not correspond to a minimum cover set, then an 
instance with fewer changes could be obtained by choosing 
them to be a minimum cover set. Furthermore, an update 
in which / does not include all Si cannot produce a resolved 
instance with fewer changes than our update process. This 
is because, for each Si not in I, at least one additional value 
from among the values of attributes in RHSijni) for tuples 
in Si was changed relative to our update process. Thus, 
the update could be changed so that all Si are in I without 
increasing the number of changes, and the resulting update 
would have at least as many changes as one in which the set 
of update values corresponds to a minimum set cover. This 
implies that a value from K occurs as a value of attribute Z 
in all MRIs iff the value does not correspond to a cover set. 
Thus, RAP is hard for the query ttzS. 

Case (l)(b): Let F be the min set cover instance from 
case (I)(a), and define sets of values K and P as before. In 
addition, define a set Y of 2n values and values a, c. 

Relations R and 5* contain a set Si for each e^, 1 <i <n 
as before. However, these sets now contain one more tuple 
than in case (l)(a). On attributes in L tuples in each Si take 
the same value as in case (I)(a). Let {fcj , ^2, ...fc|g.|} and 
{k'{, k2, ...fc|g.|} be lists of all the values in K corresponding 
to sets to which belongs such that k'i — k"^^^^^ |s |+i- ^'^^ 
some R-component of mi containing an attribute of E, for 
each 1 < j < \Si\, there is a tuple in 5"; that takes the value 
kj on all attributes in this component and the value k'J on 
all attributes of all other R-components of mi containing 



attributes of E. (We do this to ensure that all tuples in 
all Si are in singleton equivalence classes of before the 
first update, and so their values are not updated by the 
application of m2 in this update.) There is also a tuple that 
takes the value a on all attributes of all R-components of 
mi containing attributes of E. On all other attributes, all 
tuples in all Si take the value b. 

Relation R also contains a set Gi of 2n other tuples. For 
each value in Y , there is a tuple in Gi with that value as 
the value of all attributes of i? in L. There are 2n tuples 
with value a for all attributes in E. For all attributes of R 
in RHS{m2), all tuples in Gi take the value c. On all other 
attributes, tuples in Gi take the value 6. 

Relation S also contains a set G2 of m-l-I other tuples. For 
each value in K, there is a tuple in G2 that takes this value 
on all attributes A such that there is an attribute B £ E 
such that B Hi A occurs in m2. This tuple also takes this 
value on some attribute Z of S in RHS{m2). There is also 
a tuple t\ which takes the value a on all attributes A such 
that there is an attribute B £ E such that B ^ A occurs in 
m2, and the value c on Z. For all other attributes, all tuples 
in G2 take the value b except ti which takes the value c. 

As in case (l)(a), a resolved instance is obtained in two 
updates. We now describe a series of updates that leads to 
an MRI. The equivalence classes of Tmi are the sets Si as 
before. The sets of modifiable values in RHS{m\) are the 
sets of values of tuples in 5"; for attributes in an R-component 
of mi that contains an attribute of E. We again choose the 
update values to correspond to a minimum set cover, and 
we choose the same update value for all R-components for a 
given Si. 

Before the first update, there is one equivalence class of 
containing all tuples that have value a for attributes in 
E. The values of all attributes in RHS{m2) are modifiable 
for tuples in this equivalence class. We choose c as the com- 
mon value. After the first update, the equivalence classes of 
are as in case (l)(a), and we choose the same update 
values as before. 

As in case (l)(a), the changes made to values of attributes 
in RHS (7712) in the first update are overwritten by those 
made in the second update. As in that case, this implies 
that the total number of changes is the number of changes 
made to the attributes of mi during the first update plus 
the number of subsets in a minimum set cover. 

If the update value chosen for the RHS (7712) attributes 
of the equivalence class of in the first update is not c, 
the resulting resolved instance cannot be an MRI. Indeed, 
suppose that there is a different value that can be used to 
obtain an MRI. If this value is chosen, then the number of 
changes to the values of attributes of RHS{m2) for tuples in 
Gi resulting from the update is at least 2n. Since our update 
process makes at most n changes to these values and the 
minimum number of changes to the values of attributes of 
RHS(mi), this implies that these values must be modifiable 
after the first update so that they can be changed back to 
their original value in the second update. Modifiability can 
only be achieved by updating the values of attributes in 
RHS{m\) to a for some Si in the first update. However, 
this would result in at least 3 changes to values in tuples 
in Si in the second update, since these tuples would then 
be in the same equivalence class of Tm^ as the tuples in 
Gi. Because other choices of update values for Si in the 
first update result in only 1 change, this cannot produce an 



MRI. In fact, this shows that, even if the first update using 
m2 is kept the same as in our update process, using a as the 
update value for the RHS{mi) attributes of Si in the first 
update will not produce an MRI. 

When mi is applied to the instance in the first update, 
the set of values for the attributes in an R-component of mi 
for a given Si are updated to a common value. Suppose that 
for each R-component, the update value is a value in K that 
is in the set, and the update values for the R-components 
are not all the same. It is straightforward to show that 
this implies that all the tuples in 5"; will be in singleton 
equivalence classes of after the first update, and so will 
not be changed in the second update. As we have shown, for 
any update process leading to an MRI, at least one change 
must be made to the values of attributes in RHS(m2) for 
tuples in Si during the first update. Since these changes are 
undone in our update process, the number of updates to the 
tuples in Si is at least one greater than in our update process. 
The result now follows from exactly the same argument used 
in case (l)(a), except with the additional requirement for 
Si in / that their update values are the same for all R- 
components of mi. 

Case (2): For simplicity of the presentation, we will as- 
sume that there exists only one attribute Am E not in mi . 
Let F be the min set cover instance from case (l)(a), and 
define sets of values K and P as before. In addition, define 
m sets Yi, 1 < i < m, of 2n values and values a, b, and c. 

Relations R and S contain a set Si for each Ci, 1 < 
i < n, as before. However, Si now contains two tuples 
for each set to which e; belongs. On attributes in L, tu- 
ples in each 5"; take the same value as in case (l)(a). Let 
K' = {k[,k2,...k's,\} and K" = {fc", fca, ...fcfg.,} be lists as 
defined in case (l)(b). For each value k'i G K' , there are 
two tuples in Si that take this value on all attributes in all 
R-components of mi containing an attribute of E. On the 
attribute A, one of these tuples takes the value k'i and the 
other takes the value fc". On all other attributes, all tuples 
in all Si take the value b. 

Relation R also contains a set Gi of 4nm other tuples. 
For each value in each Yi, 1 < i < m, there are two tuples 
ti and t2 in Gi with that value as the value of all attributes 
of i? in L. Tuple ti takes the value a for all attributes in E 
except A, and t2 takes the value in V corresponding to Si 
on these attributes. For all attributes of R in RHS (7712), ti 
takes the value c and t2 takes the value in V corresponding 
to Si. On attribute A, both tuples take the value in V that 
corresponds to 5*;. On all other attributes, tuples in Gi take 
the value b. 

Relation S also contains a set of tuples G2 containing 
2nm tuples. For each value in each Yi, 1 < i < m, there 
is a tuple in G2 that takes the value on all attributes in L. 
On all attributes in all R-components of mi that contain 
an attribute of E, tuples in Gi take the value a. For all 
attributes of 5* in RHS{m2), all tuples in G2 take the value 
c. On all other attributes, tuples in Gi take the value b. 

Relation S also contains a set of tuples G3 containing m 
tuples. For each value in K, there is a tuple in G3 that takes 
this value on all attributes A such that there is an attribute 
B £ E such that B ^ A occurs in m2. The tuple also takes 
this value on some attribute Z of 5* in RHS(m2). For all 
other attributes, all tuples in G3 take the value b. 

As in case (1), a resolved instance is obtained in two up- 
dates. We now describe a series of updates that leads to 



an MRI. The equivalence classes of Tmi are the sets Si, as 
well as 2nm sets of 3 tuples, two from Gi and one from G2 
that take the same value on attributes in L. For the Si, 
we choose the update values for attributes in RHS{m\) in 
the same way as in case (l)(b). For the other equivalence 
classes, we choose the update value a. 

Before the first update, the only equivalence classes of 
such that the RHS{m2) attribute values are modifiable 
are those containing tuples from the sets Si . Each of these 
equivalence classes includes tuples in 5"; that take a given 
value V from V on all attributes in E (including A), as well 
as those tuples of Gi that take the value v on these attributes 
and the tuple from G3 that contains this value. Call such 
an equivalence class _E„. We choose v as the update value 
for each E^. 

After the first update, the equivalence classes of Tm^ are 
similar to those in case (1). As in that case, we choose 
update values in the second update so as to overwrite the 
the changes made to values of attributes in RHS{m2) in the 
first update. This implies that the total number of changes 
is the number of changes made to the attributes of mi during 
the first update plus the number of subsets in a minimum 
set cover. 

We now show that, as in case (1), the value in a tuple in 
G3 that corresponds to a given set in V changes in some MRI 
iff that set is in a min set cover. Consider the first update 
produced by the application of m2. Suppose that the update 
value for an equivalence class is not v, and assume for a 
contradiction that this leads to an MRI. This update would 
result in at least 2n changes in the values of tuples in Gi, 
and thus would produce at least n more changes than the 
maximum number of changes that our update process could 
produce. Therefore, at least some of the values of tuples in 
Gi in this equivalence class must be modifiable after the first 
update, so that they can be restored to their original values. 
This implies that, in the update produced by mi, the update 
value chosen for any such modifiable tuple cannot be a, or 
it would be in a singleton equivalence class of after the 
update. However, not choosing a as the update value would 
result in at least one more change relative to our update 
process. This is because the updated values include at least 
one more a than any other value. Thus, the first update 
value for the equivalence classes of must be chosen as 
in our update process in order to obtain an MRI. 

Consider the update resulting from the application of mi. 
If an update to an equivalence class involving tuples of Gi 
and G2 does not use the value a, then the resolved instance 
obtained cannot be an MRI. This is because using any other 
choice of value would result in at least one more change 
in these tuples relative to our update process in the first 
update, and cannot result in fewer updates in the second 
update since choosing a makes the values in tuples in the 
equivalence class unmodifiable. The result now follows from 
an argument similar to that of case (1). 

Case (3): Let F be the CS instance from case (l)(a), and 
define sets of values K and P as before. Let E' be an ES 
containing attributes of mi. Since the MDs are interacting, 
there must be at least one such ES, and by assumption, it 
must contain an attribute of LHS{m\). Let Gi denote some 
R-component of mi that contains an attribute of E' , and 
let p denote the number of attributes in Gi. Let G2 denote 
some R-component of m2 . Let q be the number of attributes 
of i? in G2. We define a set W of values of size , and mn 



sets Yij, 1 < i < m, l<j<n, ofp + q elements each. We 
also define a value a. 

Relations R and S contain a set Si for each set fi, 1 < 
i < m, in V. For each element e-, in fi, Si contains a set Sij 
of p + g tuples. On all attributes of L, all tuples in Si take 
the value ki in K corresponding to fi . For any given Sij , for 
a set of p tuples in Sij , each value in W occurs once as the 
value of an attribute in Ci for a tuple in the set. All other 
tuples in Sij take the value a on all attributes in Ci. For 
each value in Yij, there is a tuple in Sij that takes the value 
on all attributes in C2. On all attributes of E, each tuple 
in Sij, 1 < i < m, takes the value Vj in P that is associated 
with Cj. On all other attributes, all tuples in Si take the 
value a. 

Relation 5* also contains a set of tuples Gi. For each 
pair {fi,ej) G V x U, there is a set of tuples Xij in Gi 
of size p + q. For all attributes of 5 in the L-component 
containing the attributes of E, each Xij takes the value Vj 
in P associated with ej. For each value in Yij, there is a 
tuple in Xij that takes this value on all attributes of G2. 
On all other attributes, all tuples in Gi take the value a. 

A resolved instance is obtained in two updates. The equiv- 
alence classes of Tmi are the sets Si. The effect of the first 
update is to change all values of all attributes in Gi for tu- 
ples in Si to a common value. It is easy to see that if the 
update value is not a, then all tuples in Si will be in singleton 
equivalence classes of Tm.2 after the update. Thus, the equiv- 
alence classes of after the update are [Jj Sij, 1 < j < n, 
where J = {i \ a was chosen as the update value for Si}. If 
the update value a is chosen for 5"; for some i, we say that 
Si is unblocked. Otherwise, it is blocked. 

Consider a blocked Si. In the first update, the minimum 
number of changes to values for attributes in RHS{mi) is 
p{p-\-q)k— 1, where k is the number of elements in fi. Min- 
imal change of the values of attributes in G2 for tuples in an 
equivalence class of Tm^ is achieved by updating to one of 
the original values. The number of changes to values of at- 
tributes in RHS{m2) for tuples in Si depends on the number 
of sets Sij that are contained in Si that contain the tuple 
with this update value. The greater this number, the fewer 
the changes. We will take this into account later, but we 
ignore it for now and assume that the values of attributes of 
RHS(m2) are updated to values outside the active domain 
in the first update. Under this assumption, the resulting 
upper bound on the number of changes is q^k + d(p + q)k, 
where d is the number of attributes of 5* in G2. Since all tu- 
ples in Si are in singleton equivalence classes of Tm^ after the 
first update, the second update produces no further changes. 
Therefore, the number of changes of values for tuples in Si 
is at most p{p + q)k —1-1- q^k -\- d{p + q)k. 

For an unblocked Si, the minimum number of changes to 
values for attributes in RHS{m\) is p^fc. Since the second 
update "overwrites" the first, the number of changes to the 
values of attributes in RHS(m2) is the number of changes 
produced in the second update. Minimal change of the val- 
ues of attributes in G2 for tuples in an equivalence class of 
is achieved by updating to one of the original values for 
these tuples and attributes. A set Sij is good if all values 
in the set of values of attributes in G2 for tuples in Sij are 
modified to a value in the set in the second update. A set 
Si is good if it contains a good Sij . Sets Sij and Si that are 
not good are bad. The number of changes to attributes of 
RHS{m2) for a bad unblocked 5*; is q{p + q)k + d{p + q)k. 



and for a good unblocked Si it is at most q{p -(- g)fc -f d{p -\- 
q)k — {q + d). Thus the total number of changes for the bad 
and good cases is p^k + q{p + q)k + d{p + q)k and at most 
p^k -\- q{p + q)k + d{p + q)k — {q + d), respectively. If the 
upper bound on the number of changes from the previous 
paragraph is taken as the number of changes for blocked Si, 
it is easy to verify that for a given good (bad) Si, the num- 
ber of changes when Si is unblocked (blocked) is strictly less 
than the number of changes when Si is blocked (unblocked). 

Consider a sequence / of two updates in which all Si are 
chosen to be unblocked in the first update. Assume that 
all sets of values that must be updated to a common value 
are updated to a value in the set, except the values of at- 
tributes in RHS{m2) in the first update. We now show how 
to improve this pair of updates in order to obtain a pair of 
updates leading to an MRI. For each j, there is exactly one 
i such that 5*^ is good. Since all values of the attributes in 
G2 occur with the same frequency, the number of changes 
resulting from the two updates does not depend on which 
Sij are chosen to be good. The number of changes resulting 
from applying I to the instance is reduced by changing all 
bad Si to blocked. This improvement is maximized by max- 
imizing the number of bad Si, which can be accomplished 
by choosing the set of good Si so that it corresponds to a 
minimum set cover. Denote by /' the pair of updates ob- 
tained by changing / so that it conforms to this choice of 
good Si and by changing all the resulting bad 5"; to blocked. 

We now remove the assumption that values from outside 
the active domain are used as update values for attributes in 
G2 in the first update. This has no effect on the number of 
changes for tuples in unblocked 5*;, since the first update is 
"overwritten" for these tuples. However, if the update value 
for a given equivalence class of Tm^ is chosen as one of the 
values of a tuple in a blocked Si, it reduces the number of 
changes. Let I" be the sequence of updates obtained by 
modifying I' so that each update value for an equivalence 
class of in the first update is chosen from among the 
values of tuples in the equivalence class that are in a blocked 
Si. It is easy to verify that any I" obtained in this way 
produces an MRI, and that no other update process will 
produce an MRI. Hardness of the pair of MDs now follows 
from the fact that the only values that are unchanged in 
all MRIs among the values of attributes in G2 are values in 
those 5*; that correspond to cover sets. □ 

Proof of Proposition 2: We prove the proposition for 
HSC sets. In the proof, for an MD m, we use the term 
transitive closure of m, denoted Tm, to refer to the transitive 
closure of the binary relation that relates pairs of tuples 
satisfying the similarity condition of m. For a set of MDs 
M , the transitive closure of M, denoted Tm is the union of 
the transitive closures of the MDs in m. 

Consider an instance D and set of matching dependencies 
M. Consider a MD m of the form 

R[A] ^ R[A] R[B] = R[B] 

Let L be the set of all lengths of cycles on the vertices cor- 
responding to the MDs in PS{m). Let n = LCM(L) be 
the period of m. It is easy to see that there exists a set 
{Si, 52, ...Sn} of subsets of PS{m) with transitive closures 
{Ti, r2, ...T„}, where Ui Si = PS(m), such that the follow- 
ing holds. Let Di denote an instance obtained by updating 
D i times according to M, and for a tuple t G D, denote 



the tuple with the same identifier in Di by i\ Let (B, B) be 
a corresponding pair of {B,B). After D has been updated 
i + a times for a sufficiently large, according to M to ob- 
tain an instance -Di+a, for all tuples f in a given equivalence 
class E oiTi, 

f""'[B]=e+''[B]^vf (6) 

for some value vf . Let D' be a resolved instance. D' satisfies 
the property that any number of applications of the MDs 
does not change the instance. Therefore, D' must satisfy 
(6) for all i. That is, for all 1 < i < n, for any equivalence 
class E of Ti, and for all tuples t in E, 

t'[B]=t'[B]^vf (7) 

where t' is the tuple in D' with the same identifier as t. 

By (7), for any pair of tuples ti and satisfying 
Tps(m){ti,t2), t'l and t'2 must satisfy T'{t'^,t'2), where T' 
is the transitive closure of the binary relation on tuples ex- 
pressed by t'l [B] = t'2 [B] . Since the equality relation is closed 
under transitive closure, this implies the following property: 

Tps{^){ti,t2) implies t'^[B] = t'2[B] (8) 

Equation (8) implies that the attribute values for the tu- 
ple/attribute pairs specified in the proposition must be equal 
in a resolved instance. By specifying a series of updates such 
that only these values are changed, we now show that these 
are the only changed values in an MRI. 

D is updated as follows. For sufficiently large a, after each 
update attribute B must satisfy an equation of the form of 
(6) for each m for which B € RHS{m). Let T be the transi- 
tive closure of the set of all Tps(-m) such that B G RHS{m). 
For the (i + a)"* update, if the values of B must be modified 
to enforce (6), use as the common value for all equivalence 
classes E contained within a given equivalence class of T 
the most frequently occurring value for B in this equiva- 
lence class of T. If there is more than one most frequently 
occurring value, choose any such value. After a finite num- 
ber of updates, an instance is obtained that satisfies (8). 

We must show that this update process does not change 
any values other than those that must be changed to satisfy 
(8). The theorem will then follow from the fact that the 
fewest possible values were changed in order to enforce (8). 
Let {Ti, T2, ...T|M| } denote the set of transitive closures of 
the MDs {mi, 7712, ...m|jv/| } in M. For any intermediate in- 
stance / obtained in the update process, let tj denote the tu- 
ple in I with the same identifier as t in the original instance. 
We will show by induction on the number of updates that 
were made to obtain / that for any j, whenever Tj{ti,t'j) 
for tuples t and t' , it holds that T{t,t'). This implies that 
updates made to t[A] for any tuple t and attribute A can 
only set it equal to the common value for the equivalence 
class of T to which t belongs. 

By definition of T, if updates were used to obtain /, 
Tj(ti,t'j) implies Tj{t,t') implies T{t,t'). Assume it is true 
for instances obtained after at most k updates. Let / be an 
instance obtained after fc -I- 1 updates. Consider the MD 

nij : R[A] R[A\ ^ R[B] = R[B] 

Suppose for the sake of contradiction that there exist tuples 
ti and t'j such that Tj{ti,t'j) but -^T{t,t'). Let /' be the 

^^We use the term "update" even if a resolved instance is 
obtained after fewer than i modifications. In this case, the 
"update" is the identity mapping on all values. 



instance of which I is the updated instance. Then, there 
must be a set of tuples U ^ {t° , ...f] with t° = t and f = 
t' such that t]~^[A] «j t}[A] for aU 1 < i < p. By choice of 
update value, for all i, r(t'~^, s'"^) and T{t\ s'), where s'^^ 
and s' are tuples such that, s}7^[A] = and s],[A] = 

t'j[A]. By s^7^[^] ~j Sj/[j4] and the induction hypothesis, 
r(s'~\s'). By transitivity, this implies T{e~^,f) for aU i, 
which implies T{t,t'), a contradiction. □ 

Proof of Theorem 5: We express the query in the form 

Q{y) = 3zQi{z,y) (9) 

Let Xij denote the variable of 2 or y which holds the value of 
the j*'' attribute in the i"* conjunct Ri in Qi. Denote this 
attribute by Aij. Note that, since variables and conjuncts 
can be repeated, it can happen that Xij is the same variable 
as Xki for {i,j) 7^ (fc, I), that Aij is the same attribute as Aki 
for 7^ {k,l), or that Ri is the same as Rj for i 7^ j. Let 
B and F denote the set of bound and free variables in Qi, 
respectively. Let C and U denote the variables in Qi hold- 
ing the values of changeable and unchangeable attributes, 
respectively. Let Q'{y) denote the rewritten query returned 
by algorithm Rewrite, which we express as 

Q'{y) = 3zQ[{z,y) 

We show that, for any constant vector a, Q'{a) is true for 
an instance D iff Q{a) is true for all MRIs of D. 

Suppose that Q'(a) is true for an instance D. Then there 
exists a b such that Q'i{b,a). We will refer to this assign- 
ment of constants to variables as Aqi . From the form of 
Q' , it is apparent that, for any fixed i, there is a tuple 
ti — Ci = {cii, Ci2, ...Cip) such that Ri{ci) is true in D with 
the following properties. 

1. For all Xij except those in P] C, Cij is the value as- 
signed to Xij by Aqi. 

2. For all Xij £ F f]C, there is a tuple t2 with attribute 
B such that Dup{ti,Aij,t2,B), and the value of t2[B] 
is the value assigned to Xij by Aqi. Moreover, this 
value occurs more frequently than that of any other 
tuple/attribute pair in the same equivalence class of 
Dup. 

For any given MRI D' , consider the tuple t'l in D' with the 
same identifier as ti. Clearly, this tuple will have the same 
values as ti for all unchangeable attributes, which by 1., are 
the values assigned to the variables Xij e U. Also, by 2. 
and Corollary 3, for any j such that Xij G Ff]C is free, 
the value of the j^^ attribute of t'l is that assigned to xtj by 
Aqi. 

Thus, for each MRI D' , there exists an assignment Aq of 
constants to the Xij that makes Q true, and this assignment 
agrees with Aqi on all Xij ^ Bf^C. This assignment is 
consistent in the sense that, if Xij and Xki are the same 
variable, they are assigned the same value. Indeed, for Xij 
Bf]C, consistency follows from the consistency of Aqi , and 
for Xij £ Bf]C, it follows from the fact that the variable 
represented by Xij occurs only once in Q, by assumption. 
Therefore, Q{a) is true for all MRIs D' , and a is a resolved 
answer. 

Conversely, suppose that a tuple a is a resolved answer. 
Then, for any given MRI D' there is a satisfying assignment 
Aq to the variables in Q such that z as defined by (9) is 



assigned the value a. We write Q' in the form 



Q'{y) 3z Ai<i<„ Qi{vi) 



(10) 



with Qi the rewritten form of the i*^ conjunct of Q. For any 
fixed i, let t' — {c'n, Ci2, ■■■c'ip) be a tuple in D' such that c'ij 
is the constant assigned to Xij by Aq. 

We construct a satisfying assignment Aqi to the free and 
existentially quantified variables of Q' as follows. Consider 
the conjunct Qi of Q' as given on line 17 of Rewrite. Assign 
to v'i the tuple t in D with the same identifier as t' . This 
fixes the values of all the variables except those Xij G Ff^C, 
which are set to c'ij. It follows from lemma 3 that Aqi 
satisfies Q' . Since Aq and Aqi match on all variables that 
are not local to a single Qi, Aqi is consistent. Therefore, d 
is an answer for Q' on D. □ 

Proof of Theorem 6: Hardness follows from the fact that, 
for the instance D resulting from the reduction in the proof 
of Theorem 3.3 in [13], the set of all repairs of D with respect 
to the given key constraint is the same as the set of MRIs 
with respect to m. The key point is that attribute modifica- 
tion in this case generates duplicates which are subsequently 
eliminated from the instance, producing the same result as 
tuple deletion. Containment is easy. □ 

Proof of Proposition 3: Take A — {Ai, ...Am) and B = 
(Bi,..., Bn). For any tuple of constants fc, define = 
a^^j^R. Let Bi denote the single attribute relation with 
attribute Bi whose tuples are the most frequently occurring 
values in ■kb^R''. That is, a £ Bf iff a £ iVBiR^ and there 
is no fe £ ttb- i?* such that b occurs as the value of the Bi 
attribute in more tuples of i?*^ than a does. Note that Bi can 
be written as an expression involving R which is first order 
with a Count operator. The reduction produces {R', t) from 
{R, t), where 



The repairs of R' are obtained by keeping, for each set of 
tuples with the same key value, a single tuple with that key 
value and discarding all others. By lemma 3, in a MRI of D, 
the group Gj, of tuples such that A = k for some constant 
k has a common value for B also, and the set of possible 
values for B is the same as that of the tuple with key fc in a 
repair of D. Since duplicates are eliminated from the MRIs, 
the set of MRIs of D is exactly the set of repairs oi R' . □ 

Proof of Theorem 7: Q" is obtained by composing Q! 
with the transformation R ^ R' , which is a first-order query 
with aggregation. □ 



R' = \J [ttaR^ X X • ■ ■ Bf;] 



(11) 
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